Search notes:

Regular expressions

A regular expression (aka regex) is a set of rules that put any imaginable character string into one of two groups: The string is said to match the regular expression if it follows the rules.
Thus, these rules specify a pattern of text. For example, it's possible to define a regex that can be used to determine if some text is a phone number or an email addresses.
With regular expressions, its also possible to extract some matched (sub-)string from the text or replace the matched text with other text.
The regular expression itself is a character string.

Some rules

Most characters match themselves

A basic rule of regular expressions is that most characters match themselves.
Regexp text matches?
p p
p pear
p apple
p banana
ppl p
ppl pear
ppl people
ppl apple
Note: it is sufficient for text to contain the regular expression in order to match it.

Meta characters

Meta characters are characters that are used to create or specify regular expression rules. Thus, they have a special meaning and don't match themselves.
Some common meta characters are:
  • The dot: .
  • Caret and dollar sign: ^ $
  • Star, plus and question mark: * + ?
  • Parentheses and braces: ( ) { } [ ]
  • Backslash: \
  • Vertical bar: |
In order to match such a meta character, they need to be escaped with the meta character backslash: \

Meta character: dot

The dot (.) matches any single character except a new line (\n).
When single line mode is enabled (usually with something like (?s)…), the dot also matches a new line, that is, it matches all characters.
Regexp text matches?
. q
. foo
ba. foo
ba. bar
ba. baz
ba. qabanti
b.r bar
b.r baz
... p
... pq
... pqr
... pqrs
... pqrst

Meta characters: caret and dollar

The caret (^) and the dollar sign ($) match a position rather than a character: the caret matches the position beginning of string, the dollar sign matches the position end of string.
Regexp text matches?
two one two three
^two one two three
two$ one two three
two$ one two
^two one two
^two two three
^t.. two three
^t.. two
^t.. the good
^t.. t
^t.. xyz

Meta characters: star, plus, question mark, curly braces

The meta characters * + ? and { } are quantifiying meta characters. They control how often the preceding character (or atom, to be defined later) are matched:
* means to match it 0, 1 or more times.
+ matches 1 or more times
? matches 0 or 1 times
{n} matches excactly n times.
{n,m} matches between n and m times.
{n,} matches at least n times.
{,m} matches at most m times. (The .NET regular expression engine does not recogize this variant, {0, m} must be used.
Regexp text matches? Comment
x+ foo
x+ xyz
x+ axis
x* axis
x* apple This is tricky: apple does, in fact, have zero or more x's.
x? apple
x+ apple
x{3} axis
x{3} exxon
x{3} axxxr
x{3} axxxqr
Aq{2,3}Z AqZ
Aq{2,3}Z AqqZ
Aq{2,3}Z AqqqZ
Aq{2,3}Z AqqqqZ

Meta characters: backslash to escape following meta character

One of the uses of the backslash (\) is to escape the following meta character in order to make it match the literal character rather than meaning of the meta character.
Regexp text matches?
a.c abc
a.c a.c
a\.c a.c
a\.c abc
bla$ bla bla
bla$ more bla$
bla\$ more bla$

Meta characters: square brackets to define a set of characters

Square brackes ([…]) defines a set of characters that match any character that is in this set. For example, [aeiou] matches (one) lowercase-vowel.
Such a set of characters is especially useful when combined with one of the quantifers.
Regexp text matches?
[aeiou] xyz
[aeiou] one
[aeiou] two
[aeiou]{2} two
[aeiou]{2} three
If the first character in the square brackets is a caret, it negates the set of characters. [^aeiou] matches any non-lowercase-vowel).
Regexp text matches?
[^aeiou] X
[^aeiou] e
[^aeiou] ef
^[^aeiou]$ ef
^[^aeiou]$ i`

Meta characters: parentheses to create a sequece

Parentheses create a sequence of characters or embedded sequences. As with square brackets, such sequences are especially useful with quantifiers.
Regexp text matches?
(xyz){2,} xxyyzz
(xyz){2,} one xyz
(xyz){2,} two xyzxyz!
(xyz){2,} three xyzxyzxyz.

Matching «special» characters

A backslash followed by a n, r, t or f match some special characters:
\n New line
\r Carriage return
\t Tabulator (ASCII 9)
\f Form feed

Meta characteres: backslash for special groups of characters

The backslash followed by some specific character matches some predefined character set.
is equivalent to
\d [1234567890] or [0-9]
\w [A-Za-z0-9_]
\s [ \t\r\n\f] (that is: any whitespace character)
If the backslashed character is uppercase, it negates the meaning of its lowercase cousin. Thus \D is equivalent to [^0-9], etc.
Regexp text matches?
\d{3} 42
\d{3} 123
\D{3} 42
\D{3} ab
\D{3} abc
\w\d 4a
\w\d a4

Meta characters: vertical bar

The vertical bar separates two (sub-) regular expressions. At least one of these two regular expressions need to match in order for the complate regular expression to match.
Because the vertical bar has a low precedence, it is often used within parentheses.
Regexp text matches?
\d{3}|[^aeiou]{2} 12
\d{3}|[^aeiou]{2} a
\d{3}|[^aeiou]{2} a123
\d{3}|[^aeiou]{2} ee99
\d{3}|[^aeiou]{2} ff99
(\d|\w){2} 1a
(\d|[xyz]){2} 1
(\d|[xyz]){2} 1x
(\d|[xyz]){2} 2y
(\d|[xyz]){2} xy
(\d|[xyz]){2} ay

Matching word boundaries

Similarly to ^ and $ matching the beginning or end of an entire string, there is a notation to match the beginning or end of a word: \b. Such a word boundary is (usually) between a \w and a \W or ^ or $. Like ^ and $, the word boundary matches a position rather than one (or more) characters. Technically, this is referred to as zero-width assertion.
Regexp text matches?
\bxyz xyz one
\bxyz one xyz
\bxyz blaxyz
\bxyz xyzbla
\bxyz\b blaxyzbla
Note that some regular expression dialects (VIM) use \< and \> to match on the left or right side of a word.

(?…)

(?<name>…), (?'name'…) Named matched subexpression In substitution, use ${name} to refer to named subexpression
(?:…) Non-capturing group
(?imnsx-imnsx:…) Group options For example, (?:i-s:…) turns case insensitivity on and disables single-line mode. See also flags/options below
(?<name_prev-name_cur>…), (?'name_prev-name_cur'…) Balancing group definition name_prev is optional
(?=…), (?!…), (?<=…), (?<!…) Zero width negative/positive lookbehind/lookahead assertions
(?>…) Atomic group Aka nonbacktracking subexpression, atomic subexpression or once-only subexpression
(#…) Comment Compare with the x flag in combination with a # that comments to the end of line

Lookaround assertions

A lookaround assertion makes sure that a given atom
  • matches (positive lookaround assertion, indicated by a =), or
  • doesn't match (negative lookaround assertion, indicated by a !)
at a given position in the text to be matched, yet without adding the atom to the matched text or consuming the matched part from the test.
The = or ! is preceded by a < if the assertion is to be look-behind.
Positive Negative
Look-ahead (?=ATOM) (?!ATOM)
Look-behind (?<=ATOM) (?<!ATOM)
A negative look-behind assertion might be used in an SQL file to search for a given pattern that is not in a commented (--) line:
/(?<!--.*)pattern/
Compare with the lookaround assertions in Vim.

Flags / Options

Most implementations allow to alter the behaviour of regular expressions with flags. Typically, such flags include
i ignore case
m match in multi line mode
s single line mode The dot (.) also matches new lines
x Ignore unescaped white-space in regexp. # introduces comment up to end of line

Hexadecimal notation for characters

At least in .NET's regexp-engine, a character can be matched with its hexadecimal notation. Hexadecimal 74 is a t.
PS> 'one two three' -match '\x74'
True

PS> 'four five six' -match '\x74'
False

Some examples in a few environments

VIM
PCRE (Perl compatible regular expressions]
R functions: regular expressions
Perl module DBD::SQLite - regexp
Python's standard library re
SAS functions for regular expressions: prxchange and prxmatch, prxmatch etc.
See also SAS macros for regular expressions
SQLite function regexp
Regular expressions in PHP
PHP code snippets: regular expressions for SQLite
Oracle functions for regular expressions
Similarly for SQL Server.
Regular expressions in bash
The JavaScript RegExp object and String methods that operate on regular expressions.
The shell command grep
Regular expressions in VBA
PowerShell:
findstr.exe
SQL standard: the features F841, F842, F843, F844, F845 (like_regex, occurrences_regex, position_regex, substring_regex, translate_regex) and feature T581.

See also

The COM object (?) Microsoft VBScript Regular Expressions.
Linux shell: Using grep to find files matching multiple regular expressions on different lines
The .NET
A minimalistic C-Sharp class to remove SQL comments with regular expressions.
When Ken Thompson reimplemented qed, he added regular expressions.
cs.github.com promises to search for code with regular expressions.
This VBA function finds cells in an → Excel worksheet whose values match a given regular expression.

Links

Henry Spencer wrote a non-proprietary replacement for regex(3) and made it freely available.

Index