With regular expressions, its also possible to extract some matched (sub-)string from the text or replace the matched text with other text.
The regular expression itself is a character string.
Some rules
Most characters match themselves
A basic rule of regular expressions is that most characters match themselves.
Regexp | text | matches? |
p | p | ✓ |
p | pear | ✓ |
p | apple | ✓ |
p | banana | ✗ |
ppl | p | ✗ |
ppl | pear | ✗ |
ppl | people | ✗ |
ppl | apple | ✓ |
Note: it is sufficient for text
to contain the regular expression in order to match it.
Meta characters
Meta characters are characters that are used to create or specify regular expression rules. Thus, they have a special meaning and don't match themselves.
Some common meta characters are:
- The dot:
.
- Caret and dollar sign:
^
$
- Star, plus and question mark:
*
+
?
- Parentheses and braces:
(
)
{
}
[
]
- Backslash:
\
- Vertical bar:
|
In order to match such a meta character, they need to be escaped with the meta character backslash: \
Meta character: dot
The dot (
.
) matches
any single character except a
new line (
\n
).
When
single line mode is enabled (usually with something like
(?s)…
), the dot also matches a new line, that is, it matches
all characters.
Regexp | text | matches? |
. | q | ✓ |
. | foo | ✓ |
ba. | foo | ✗ |
ba. | bar | ✓ |
ba. | baz | ✓ |
ba. | qabanti | ✓ |
b.r | bar | ✓ |
b.r | baz | ✗ |
... | p | ✗ |
... | pq | ✗ |
... | pqr | ✓ |
... | pqrs | ✓ |
... | pqrst | ✓ |
Meta characters: caret and dollar
The
caret (^
) and the dollar sign ($
) match a
position rather than a character: the caret matches the position
beginning of string, the dollar sign matches the position
end of string.
Regexp | text | matches? |
two | one two three | ✓ |
^two | one two three | ✗ |
two$ | one two three | ✗ |
two$ | one two | ✓ |
^two | one two | ✗ |
^two | two three | ✓ |
^t.. | two three | ✓ |
^t.. | two | ✓ |
^t.. | the good | ✓ |
^t.. | t | ✗ |
^t.. | xyz | ✗ |
Meta characters: star, plus, question mark, curly braces
The meta characters *
+
?
and {
}
are quantifiying meta characters. They control how often the preceding character (or atom, to be defined later) are matched:
* | means to match it 0, 1 or more times. |
+ | matches 1 or more times |
? | matches 0 or 1 times |
{n} | matches excactly n times. |
{n,m} | matches between n and m times. |
{n,} | matches at least n times. |
{,m} | matches at most m times. (The .NET regular expression engine does not recogize this variant, {0, m} must be used. |
Regexp | text | matches? | Comment |
x+ | foo | ✗ | |
x+ | xyz | ✓ | |
x+ | axis | ✓ | |
x* | axis | ✓ | |
x* | apple | ✓ | This is tricky: apple does, in fact, have zero or more x's. |
x? | apple | ✓ | |
x+ | apple | ✗ | |
x{3} | axis | ✗ | |
x{3} | exxon | ✗ | |
x{3} | axxxr | ✓ | |
x{3} | axxxqr | ✓ | |
Aq{2,3}Z | AqZ | ✗ | |
Aq{2,3}Z | AqqZ | ✓ | |
Aq{2,3}Z | AqqqZ | ✓ | |
Aq{2,3}Z | AqqqqZ | ✗ | |
Meta characters: backslash to escape following meta character
One of the uses of the backslash (\
) is to escape the following meta character in order to make it match the literal character rather than meaning of the meta character.
Regexp | text | matches? |
a.c | abc | ✓ |
a.c | a.c | ✓ |
a\.c | a.c | ✓ |
a\.c | abc | ✗ |
bla$ | bla bla | ✓ |
bla$ | more bla$ | ✗ |
bla\$ | more bla$ | ✓ |
Meta characters: square brackets to define a set of characters
Square brackes ([…]
) defines a set of characters that match any character that is in this set. For example, [aeiou]
matches (one) lowercase-vowel.
Such a set of characters is especially useful when combined with one of the quantifers.
Regexp | text | matches? |
[aeiou] | xyz | ✗ |
[aeiou] | one | ✓ |
[aeiou] | two | ✓ |
[aeiou]{2} | two | ✗ |
[aeiou]{2} | three | ✓ |
If the first character in the square brackets is a caret, it negates the set of characters. [^aeiou]
matches any non-lowercase-vowel).
Regexp | text | matches? |
[^aeiou] | X | ✓ |
[^aeiou] | e | ✗ |
[^aeiou] | ef | ✓ |
^[^aeiou]$ | ef | ✗ |
^[^aeiou]$ | i ` | ✓ |
Meta characters: parentheses to create a sequece
Parentheses create a sequence of characters or embedded sequences. As with square brackets, such sequences are especially useful with quantifiers.
Regexp | text | matches? |
(xyz){2,} | xxyyzz | ✗ |
(xyz){2,} | one xyz | ✗ |
(xyz){2,} | two xyzxyz! | ✓ |
(xyz){2,} | three xyzxyzxyz. | ✓ |
Matching «special» characters
A backslash followed by a n
, r
, t
or f
match some special characters:
\n | New line |
\r | Carriage return |
\t | Tabulator (ASCII 9) |
\f | Form feed |
Meta characteres: backslash for special groups of characters
The backslash followed by some specific character matches some predefined character set.
| is equivalent to |
\d | [1234567890] or [0-9] |
\w | [A-Za-z0-9_] |
\s | [ \t\r\n\f] (that is: any whitespace character) |
If the backslashed character is uppercase, it negates the meaning of its lowercase cousin. Thus \D
is equivalent to [^0-9]
, etc.
Regexp | text | matches? |
\d{3} | 42 | ✗ |
\d{3} | 123 | ✓ |
\D{3} | 42 | ✓ |
\D{3} | ab | ✗ |
\D{3} | abc | ✓ |
\w\d | 4a | ✗ |
\w\d | a4 | ✓ |
Meta characters: vertical bar
The vertical bar separates two (sub-) regular expressions. At least one of these two regular expressions need to match in order for the complate regular expression to match.
Because the vertical bar has a low precedence, it is often used within parentheses.
Regexp | text | matches? |
\d{3}|[^aeiou]{2} | 12 | ✗ |
\d{3}|[^aeiou]{2} | a | ✗ |
\d{3}|[^aeiou]{2} | a123 | ✓ |
\d{3}|[^aeiou]{2} | ee99 | ✓ |
\d{3}|[^aeiou]{2} | ff99 | ✗ |
(\d|\w){2} | 1a | ✗ |
(\d|[xyz]){2} | 1 | ✗ |
(\d|[xyz]){2} | 1x | ✓ |
(\d|[xyz]){2} | 2y | ✓ |
(\d|[xyz]){2} | xy | ✓ |
(\d|[xyz]){2} | ay | ✗ |
Matching word boundaries
Similarly to ^
and $
matching the beginning or end of an entire string, there is a notation to match the beginning or end of a word: \b
. Such a word boundary is (usually) between a \w
and a \W
or ^
or $
. Like ^
and $
, the word boundary matches a position rather than one (or more) characters. Technically, this is referred to as zero-width assertion.
Regexp | text | matches? |
\bxyz | xyz one | ✓ |
\bxyz | one xyz | ✓ |
\bxyz | blaxyz | ✗ |
\bxyz | xyzbla | ✓ |
\bxyz\b | blaxyzbla | ✗ |
Note that some regular expression dialects (VIM) use \<
and \>
to match on the left or right side of a word.
Hexadecimal notation for characters
At least in .NET's regexp-engine, a character can be matched with its hexadecimal notation. Hexadecimal 74 is a t
.
PS> 'one two three' -match '\x74'
True
PS> 'four five six' -match '\x74'
False