Regular expressions

A regular expression (aka regex) is a set of rules that put any imaginable character string into one of two groups:

either the string follows the rule set, or
it doesn't.

The string is said to match the regular expression if it follows the rules.

Thus, these rules specify a pattern of text. For example, it's possible to define a regex that can be used to determine if some text is a phone number or an email addresses.

With regular expressions, its also possible to extract some matched (sub-)string from the text or replace the matched text with other text.

The regular expression itself is a character string.

Some rules

Most characters match themselves

A basic rule of regular expressions is that most characters match themselves.

Regexp	text	matches?
`p`	`p`	✓
`p`	`pear`	✓
`p`	`apple`	✓
`p`	`banana`	✗
`ppl`	`p`	✗
`ppl`	`pear`	✗
`ppl`	`people`	✗
`ppl`	`apple`	✓

Note: it is sufficient for text to contain the regular expression in order to match it.

Meta characters

Meta characters are characters that are used to create or specify regular expression rules. Thus, they have a special meaning and don't match themselves.

Some common meta characters are:

The dot: .
Caret and dollar sign: ^ $
Star, plus and question mark: * + ?
Parentheses and braces: ( ) { } [ ]
Backslash: \
Vertical bar: |

In order to match such a meta character, they need to be escaped with the meta character backslash: \

Meta character: dot

The dot (.) matches any single character except a new line (\n).

When single line mode is enabled (usually with something like (?s)…), the dot also matches a new line, that is, it matches all characters.

Regexp	text	matches?
`.`	`q`	✓
`.`	`foo`	✓
`ba.`	`foo`	✗
`ba.`	`bar`	✓
`ba.`	`baz`	✓
`ba.`	`qabanti`	✓
`b.r`	`bar`	✓
`b.r`	`baz`	✗
`...`	`p`	✗
`...`	`pq`	✗
`...`	`pqr`	✓
`...`	`pqrs`	✓
`...`	`pqrst`	✓

Meta characters: caret and dollar

The caret (^) and the dollar sign ($) match a position rather than a character: the caret matches the position beginning of string, the dollar sign matches the position end of string.

Regexp	text	matches?
`two`	`one two three`	✓
`^two`	`one two three`	✗
`two$`	`one two three`	✗
`two$`	`one two`	✓
`^two`	`one two`	✗
`^two`	`two three`	✓
`^t..`	`two three`	✓
`^t..`	`two`	✓
`^t..`	`the good`	✓
`^t..`	`t`	✗
`^t..`	`xyz`	✗

Meta characters: star, plus, question mark, curly braces

The meta characters * + ? and { } are quantifiying meta characters. They control how often the preceding character (or atom, to be defined later) are matched:

`*`	means to match it 0, 1 or more times.
`+`	matches 1 or more times
`?`	matches 0 or 1 times
`{n}`	matches excactly `n` times.
`{n,m}`	matches between `n` and `m` times.
`{n,}`	matches at least `n` times.
`{,m}`	matches at most `m` times. (The .NET regular expression engine does not recogize this variant, `{0, m}` must be used.

Regexp	text	matches?	Comment
`x+`	`foo`	✗
`x+`	`xyz`	✓
`x+`	`axis`	✓
`x*`	`axis`	✓
`x*`	`apple`	✓	This is tricky: apple does, in fact, have zero or more x's.
`x?`	`apple`	✓
`x+`	`apple`	✗
`x{3}`	`axis`	✗
`x{3}`	`exxon`	✗
`x{3}`	`axxxr`	✓
`x{3}`	`axxxqr`	✓
`Aq{2,3}Z`	`AqZ`	✗
`Aq{2,3}Z`	`AqqZ`	✓
`Aq{2,3}Z`	`AqqqZ`	✓
`Aq{2,3}Z`	`AqqqqZ`	✗

Meta characters: backslash to escape following meta character

One of the uses of the backslash (\) is to escape the following meta character in order to make it match the literal character rather than meaning of the meta character.

Regexp	text	matches?
`a.c`	`abc`	✓
`a.c`	`a.c`	✓
`a\.c`	`a.c`	✓
`a\.c`	`abc`	✗
`bla$`	`bla bla`	✓
`bla$`	`more bla$`	✗
`bla\$`	`more bla$`	✓

Meta characters: square brackets to define a set of characters

Square brackes ([…]) defines a set of characters that match any character that is in this set. For example, [aeiou] matches (one) lowercase-vowel.

Such a set of characters is especially useful when combined with one of the quantifers.

Regexp	text	matches?
`[aeiou]`	`xyz`	✗
`[aeiou]`	`one`	✓
`[aeiou]`	`two`	✓
`[aeiou]{2}`	`two`	✗
`[aeiou]{2}`	`three`	✓

If the first character in the square brackets is a caret, it negates the set of characters. [^aeiou] matches any non-lowercase-vowel).

Regexp	text	matches?
`[^aeiou]`	`X`	✓
`[^aeiou]`	`e`	✗
`[^aeiou]`	`ef`	✓
`^[^aeiou]$`	`ef`	✗
`^[^aeiou]$`	`i``	✓

Meta characters: parentheses to create a sequece

Parentheses create a sequence of characters or embedded sequences. As with square brackets, such sequences are especially useful with quantifiers.

Regexp	text	matches?
`(xyz){2,}`	`xxyyzz`	✗
`(xyz){2,}`	`one xyz`	✗
`(xyz){2,}`	`two xyzxyz!`	✓
`(xyz){2,}`	`three xyzxyzxyz.`	✓

Matching «special» characters

A backslash followed by a n, r, t or f match some special characters:

`\n`	New line
`\r`	Carriage return
`\t`	Tabulator (ASCII 9)
`\f`	Form feed

Meta characteres: backslash for special groups of characters

The backslash followed by some specific character matches some predefined character set.

	is equivalent to
`\d`	`[1234567890]` or `[0-9]`
`\w`	`[A-Za-z0-9_]`
`\s`	`[ \t\r\n\f]` (that is: any whitespace character)

If the backslashed character is uppercase, it negates the meaning of its lowercase cousin. Thus \D is equivalent to [^0-9], etc.

Regexp	text	matches?
`\d{3}`	`42`	✗
`\d{3}`	`123`	✓
`\D{3}`	`42`	✓
`\D{3}`	`ab`	✗
`\D{3}`	`abc`	✓
`\w\d`	`4a`	✗
`\w\d`	`a4`	✓

Meta characters: vertical bar

The vertical bar separates two (sub-) regular expressions. At least one of these two regular expressions need to match in order for the complate regular expression to match.

Because the vertical bar has a low precedence, it is often used within parentheses.

Regexp	text	matches?
`\d{3}\|[^aeiou]{2}`	`12`	✗
`\d{3}\|[^aeiou]{2}`	`a`	✗
`\d{3}\|[^aeiou]{2}`	`a123`	✓
`\d{3}\|[^aeiou]{2}`	`ee99`	✓
`\d{3}\|[^aeiou]{2}`	`ff99`	✗
`(\d\|\w){2}`	`1a`	✗
`(\d\|[xyz]){2}`	`1`	✗
`(\d\|[xyz]){2}`	`1x`	✓
`(\d\|[xyz]){2}`	`2y`	✓
`(\d\|[xyz]){2}`	`xy`	✓
`(\d\|[xyz]){2}`	`ay`	✗

Matching word boundaries

Similarly to ^ and $ matching the beginning or end of an entire string, there is a notation to match the beginning or end of a word: \b. Such a word boundary is (usually) between a \w and a \W or ^ or $. Like ^ and $, the word boundary matches a position rather than one (or more) characters. Technically, this is referred to as zero-width assertion.

Regexp	text	matches?
`\bxyz`	`xyz one`	✓
`\bxyz`	`one xyz`	✓
`\bxyz`	`blaxyz`	✗
`\bxyz`	`xyzbla`	✓
`\bxyz\b`	`blaxyzbla`	✗

Note that some regular expression dialects (VIM) use \< and \> to match on the left or right side of a word.

(?…)

`(?<name>…)`, `(?'name'…)`	Named matched subexpression	In substitution, use `${name}` to refer to named subexpression
`(?:…)`	Non-capturing group
`(?imnsx-imnsx:…)`	Group options	For example, `(?:i-s:…)` turns case insensitivity on and disables single-line mode. See also flags/options below
`(?<name_prev-name_cur>…)`, `(?'name_prev-name_cur'…)`	Balancing group definition	`name_prev` is optional
`(?=…)`, `(?!…)`, `(?<=…)`, `(?<!…)`	Zero width negative/positive lookbehind/lookahead assertions
`(?>…)`	Atomic group	Aka nonbacktracking subexpression, atomic subexpression or once-only subexpression
`(#…)`	Comment	Compare with the `x` flag in combination with a `#` that comments to the end of line

Lookaround assertions

A lookaround assertion makes sure that a given atom

matches (positive lookaround assertion, indicated by a =), or
doesn't match (negative lookaround assertion, indicated by a !)

at a given position in the text to be matched, yet without adding the atom to the matched text or consuming the matched part from the test.

The = or ! is preceded by a < if the assertion is to be look-behind.

	Positive	Negative
Look-ahead	`(?=ATOM)`	`(?!ATOM)`
Look-behind	`(?<=ATOM)`	`(?<!ATOM)`

A negative look-behind assertion might be used in an SQL file to search for a given pattern that is not in a commented (--) line:

/(?<!--.*)pattern/

Compare with the lookaround assertions in Vim.

Flags / Options

Most implementations allow to alter the behaviour of regular expressions with flags. Typically, such flags include

`i`	ignore case
`m`	match in multi line mode
`s`	single line mode	The dot (`.`) also matches new lines
`x`	Ignore unescaped white-space in regexp. `#` introduces comment up to end of line

Hexadecimal notation for characters

At least in .NET's regexp-engine, a character can be matched with its hexadecimal notation. Hexadecimal 74 is a t.

PS> 'one two three' -match '\x74'
True

PS> 'four five six' -match '\x74'
False

Some examples in a few environments

VIM

PCRE (Perl compatible regular expressions]

R functions: regular expressions

Perl module DBD::SQLite - regexp

Python's standard library re

SAS functions for regular expressions: prxchange and prxmatch, prxmatch etc.

SQLite function regexp

Regular expressions in PHP

PHP code snippets: regular expressions for SQLite

Oracle functions for regular expressions

Similarly for SQL Server.

Regular expressions in bash

The JavaScript RegExp object and String methods that operate on regular expressions.

The shell command grep

Regular expressions in VBA

PowerShell:

A string can be tested against a regular expression with the -match and -nomatch operators.
-replace replaces regular expressions in strings
In pipelines, objects can be grepped with the select-string cmdLet.
The PowerShell module regex

findstr.exe

SQL standard: the features F841, F842, F843, F844, F845 (like_regex, occurrences_regex, position_regex, substring_regex, translate_regex) and feature T581.

Links

Henry Spencer wrote a non-proprietary replacement for regex(3) and made it freely available.