Pages

Search Code Shode

Tuesday, June 1, 2010

Learn Regular Expression (Regex) syntax with C# and .NET

What are Regular Expressions?
Regular Expressions are a powerful pattern matching language that is part of many
modern programming languages. Regular Expressions allow you to apply a
pattern to an input string and return a list
of the matches within the text. Regular expressions also allow text to be replaced
using replacement patterns. It is a very powerful version of find and replace.
There are two parts to learning Regular Expressions;
  • learning
    the Regex syntax
  • learning
    how to work with Regex in your programming language
This article introduces you to the Regular Expression syntax. After learning the
syntax for Regular Expressions you can use it many different languages as the syntax
is fairly similar between languages.
Microsoft's .NET Framework contains a set of classes for working with Regular Expressions
in theSystem.Text.RegularExpressions namespace.

Download the Regular Expression Designer

When learning Regular Expressions, it helps to have a tool that you can use to test
Regex patterns. Rad Software has a Free Regular Expression Tool
available for download that will help as you go through the article.

The basics - Finding text

Regular Expressions are similar to find and replace in that ordinary characters
match themselves. If I want to match the word "went" the Regular Expression pattern
would be "went".
Text:    Anna Jones and a friend went to lunch
Regex:   went
Matches: Anna Jones and a friend went to lunch
went
The following are special characters when working with Regular Expressions. They
will be discussed throughout the article.
. $ ^ { [ ( | ) * + ? \
Java Regular Expressions: Taming the java.util.regex EngineRegular Expressions CookbookMastering Regular ExpressionsSams Teach Yourself Regular Expressions in 10 Minutes

Matching any character with dot

The full stop or period character (.) is known as dot. It is a wildcard that will match
any character except a new line (\n). For example if I wanted to match the 'a' character
followed by any two characters.
Text:    abc def ant cow
Regex:   a..
Matches: abc def ant cow
abc
ant
If the Singleline
option is enabled, a dot matches any character including the new line character.

Matching word characters

Backslash and a lowercase 'w' (\w) is a character class that will match any word
character. The following Regular Expression matches 'a' followed by two word characters.
Text:    abc anaconda ant cow apple
Regex:   a\w\w
Matches: abc anaconda ant cow apple
abc
ana
ant
app
Backslash and an uppercase 'W' (\W) will match any non-word character.

Matching white-space

White-space can be matched using \s (backslash and 's').
The following Regular Expression matches the letter 'a' followed
by two word characters then a white space character.
Text:    "abc anaconda ant"
Regex:   a\w\w\s
Matches: 
"abc "
Note that ant
was not matched as it is not followed by a white space character.
White-space is defined as the space character, new line (\n), form feed
(\f),
carriage return (\r),
tab (\t)
and vertical tab (\v).
Be careful using \s as it can lead to unexpected behaviour by matching line breaks
(\n and
\r).
Sometimes it is better to explicitly specify the characters to match instead of
using \s. e.g. to match Tab and Space use
[\t\0x0020]

Matching digits

The digits zero to nine can be matched using
\d (backslash and lowercase 'd'). For
example, the following Regular Expression matches any three digits in a row.
Text:    123 12 843 8472
Regex:   \d\d\d
Matches: 123 12 843 8472
123
843
847

Matching sets of single characters

The square brackets are used to specify a set of single characters to match. Any
single character within the set will match. For example, the following Regular Expression
matches any three characters where the first character is either 'd' or 'a'.
Text:    abc def ant cow
Regex:   [da]..
Matches: abc def ant cow
abc
def
ant

The caret (^)
can be added to the start of the set of characters to specify that none of the characters
in the character set should be matched. The following Regular Expression matches
any three character where the first character is not 'd' and not 'a'.
Text:    abc def ant cow
Regex:   [^da]..
Matches: 
"bc "
"ef "
"nt "
"cow"

Matching ranges of characters

Ranges of characters can be matched using the hyphen (-). the following
Regular Expression matches any three characters where the second character is either
'a', 'b', 'c' or 'd'.
Text:    abc pen nda uml
Regex:   .[a-d].
Matches: abc pen nda uml
abc
nda
Ranges of characters can also be combined together. the following Regular Expression
matches any of the characters from 'a' to 'z' or any digit from '0' to '9' followed
by two word characters.
Text:    abc no 0aa i8i
Regex:   [a-z0-9]\w\w
Matches: abc no 0aa i8i
abc
0aa
i8i
The pattern could be written more simply as
[a-z\d]

Specifying the number of times to match with Quantifiers

Quantifiers let you specify the number of times that an expression must match. The
most frequently used quantifiers are the asterisk character (*) and the
plus sign (+).
Note that the asterisk (*)
is usually called the star when talking about Regular Expressions.

Matching zero or more times with star (*)

The star tells the Regular Expression to match the character, group, or character
class that immediately precedes it zero
or more times
. This means that the character, group, or character class
is optional, it can be matched but it does not have to match. The following Regular
Expression matches the character 'a' followed by zero or more word characters.
Text:    Anna Jones and a friend owned an anaconda
Regex:   a\w*
Options: IgnoreCase
Matches: Anna Jones and a friend owned an anaconda
Anna
and
a
an
anaconda

Matching one or more times with plus (+)

The plus sign tells the Regular Expression to match the character, group, or character
class that immediately precedes it one or
more times
. This means that the character, group, or character class must
be found at least once. After it is found once it will be matched again if it follows
the first match. The following Regular Expression matches the character 'a' followed
by at least one word character.
Text:    Anna Jones and a friend owned an anaconda
Regex:   a\w+
Options: IgnoreCase
Matches: Anna Jones and a friend owned an anaconda
Anna
and
an
anaconda
Note that "a" was not matched as it is not followed by any word characters.

Matching zero or one times with question mark (?)

To specify an optional match use the question mark (?). The question
mark matches zero or onetimes.
The following Regular Expression matches the character 'a' followed by 'n' then
optionally followed by another 'n'.
Text:    Anna Jones and a friend owned an anaconda
Regex:   an?
Options: IgnoreCase
Matches: Anna Jones and a friend owned an anaconda
An
a
an
a
an
an
a
a

Specifying the number of matches

The minimum number of matches required for a character, group, or character class
can be specified with the curly brackets ({n}).
The following Regular Expression matches the character 'a' followed by a minimum
of two 'n' characters. There must be two 'n' characters for a match to occur.
Text:    Anna Jones and Anne owned an anaconda
Regex:   an{2}
Options: IgnoreCase
Matches: Anna Jones and Anne owned an anaconda
Ann
Ann
A range of matches can be specified by curly brackets with two numbers inside ({n,m}).
The first number (n) is the minimum number of matches required, the second (m) is
the maximum number of matches permitted. This Regular Expression matches the character
'a' followed by a minimum of two 'n' characters and a maximum of three 'n' characters.
Text:    Anna and Anne lunched with an anaconda annnnnex
Regex:   an{2,3}
Options: IgnoreCase
Matches: Anna and Anne lunched with an anaconda annnnnex
Ann
Ann
annn
The Regex stops matching after the maximum number of matches has been found.

Matching the start and end of a string

To specify that a match must occur at the beginning of a string use the caret character
(^).
For example, I want a Regular Expression pattern to match the beginning of the string
followed by the character 'a'.
Text:    an anaconda ate Anna Jones
Regex:   ^a
Matches: an anaconda ate Anna Jones
"a" at position 1
The pattern above only matches the a in "an".
Note that the caret (^)
has different behaviour when used inside the square brackets.
If the Multiline
option is on, the caret (^)
will match the beginning of each line in a multiline string rather than only the
start of the string.
To specify that a match must occur at the end of a string use the dollar character
($).
If the Multiline option is on then the pattern will match at the end of each line
in a multiline string. This Regular Expression pattern matches the word at the end
of the line in a multiline string.
Text:    "an anaconda
ate Anna
Jones"
Regex:   \w+$
Options: Multiline, IgnoreCase
Matches: 
Jones
Microsoft have an online reference for Regex in .NET:
Regular Expression Syntax on MSDN

اشتہارات