Search notes:

Python standard library: re

Python's standard library re manages regular expressions.

`Match`
`Pattern`
`RegexFlag`
`Scanner`
`T`
`TEMPLATE`
`U`
`UNICODE`
`compile()`
`copyreg`
`enum`
`error`
`escape()`
`findall()`	Matches all occurences of a pattern (unlike `search()` which only matches the first one) and returns them as a list of strings or tuples.
`finditer()`
`fullmatch()`
`functools`
`match()`	Checks for a match at the beginning of a string. Compare with `search()`
`purge()`
`search()`	Matches one regular expression anywhere in a string. Compare with `match()` and `findall()`.
`split()`	Creates a list from a string. The regular expression is used to determine where the string is divided.
`sre_compile`
`sre_parse`
`sub()`	Replace matched text with a constant value or the value a function returns.
`subn()`
`template()`
`_MAXCACHE`
`_cache`
`_compile()`
`_compile_repl`
`_expand()`
`_locale()`
`_pickle()`
`_special_chars_map`
`_subx()`

flags

Some functions in the re module (such as re.compile()) have a flags parameter which specifies further characteristics of the regular expression. The value of this parameter is added from one or more the following values:

`re.A`	`re.ASCII`	`(?a)`	256	`\w` etc match ASCII characters only (default is that `\w` etc. match Unicode characrters)
	`re.DEBUG`		128	print debug information about compiled regexp
`re.I`	`re.IGNORECASE`	`(?i)`	2	match case insensitively
`re.L`	`re.LOCALE`	`(?L)`	4	`\w` etc. mtach case insensitively depending on the current locale
`re.M`	`re.MULTILINE`	`(?m)`	8	`^` matches start of string or character after new line
`re.S`	`re.DOTALL`	`(?s)`	16	`.` matches all chracters inclusively new line
`re.X`	`re.VERBOSE`	`(?x)`	64	allow to write more readable regexpes.

Simple script

#!/usr/bin/python
import re

re_number = re.compile('\d+')


for i in ['foo', 'bar 42 baz', 'hello', 'etc', '20' ]:

    if re_number.match(i):
       print (i + " is a number")

    if re_number.search(i):
       print (i + " contains a number")
    
    # bar 42 baz contains a number
    # 20 is a number
    # 20 contains a number


print ("---")

for found in re.findall(r'(\w+)\s+(\d+)', 'foo 42 bar 18 baz 19 x'):
    print (found[0] + ': ' + found[1])
    # foo: 42
    # bar: 18
    # baz: 19

print ("---")

print (re.sub(r'\d+', 'XX', 'foo 42 bar 18 baz 19 x'))
# foo XX bar XX baz XX x

Github repository about-python, path: /standard-library/re/script.py

search

re.search() returns a re.Match object.

#!/usr/bin/python

import re

if re.search('\d\d\d', 'one 234 five six'):
   print ("matched")
   # matched
else:
   print ("didn't match")

Github repository about-python, path: /standard-library/re/search-1.py

#!/usr/bin/python

import re

match = re.search('(\d\d\d|\w\w\w)', 'one 234 five s')
if match:
   print (match.group())
   # one
   print (match.group(1))
   # one
else:
   print ("didn't match")

Github repository about-python, path: /standard-library/re/search-2.py

Using if match := re.search… to put the if statement the return value of re.search in one line:

import re

if match := re.search('foo:\s+(\d+),\s+bar:\s+(\d+)', 'hello foo:   42, bar:  1001xyz'):

   print (match.group() )  # foo:   42, bar:  1001
   print (match.group(1))  # 42
   print (match.group(2))  # 1001

Github repository about-python, path: /standard-library/re/search-3.py

findall

#!/usr/bin/python

import re

re_numbers = re.compile('\d+')

for found in re_numbers.findall('foo 42 bar 18 baz 19 x'):
    print (found)
    # 42
    # 18
    # 19

Github repository about-python, path: /standard-library/re/findall.py

Return a list of tuples

In the following example, the pattern contains parantheses. Each match is returned as a tuple where the values of the text matched in the parantheses is captured in the elements of the tuple.

import re
for pair in re.findall('(\w+): (\d+)', 'foo: 42; bar: 99; baz: 0'):
    print(pair[0] + ' = ' + pair[1])

search() vs match()

re.search() searches within the entire text while match() only searches from the text's start.

Both, re.search() and re.match() return a re.Match object.

sub

Replace a range

Replace every character between g and p with an asterik.

Note the unintuitive order of parameters: First the pattern, then the replacement and only then the text on which the replacement is to take place.

#!/usr/bin/python
import re

txt = 'abc defghi jklmn opq rstu vwx yz';

print(re.sub('[g-p]', '*', txt))
#
#  abc def*** ***** **q rstu vwx yz
#

Github repository about-python, path: /standard-library/re/replace-range.py

Replace with the result of a function

import re
def double(m):
    print(type(m.group(0)))
    return str(2 * int(m.group(0)))

print(re.sub(r'\d+', double, 'foo 42 bar 99 baz'))

Iterate over words in a text

The following example iterates over the words in a piece of text and skips punctuation:

import re

txt = """\
Foo, bar and baz. Those three words! Do
new lines work, too? Yes: they do.\
"""

words=re.split('[ .,?;:!\n]+', txt)

for word in words:
    print(word)

Github repository about-python, path: /standard-library/re/split-text-into-words.py

Extract first line from a text

import re

text = """\
This is the first line.
The second one.
The final one."""

re_1st_line = re.compile('.*')

first_line = re.match(re_1st_line, text)

print(first_line[0])

Github repository about-python, path: /standard-library/re/extract-first-line-from-text.py

Using the returned search() object in an if statement

With the walrus operator, it is possible to assign the the object that is returned by search() in an if statement:

import re

reNumbers = re.compile('(\d+)')

def getNumber(txt):

   if m := reNumbers.search(txt):
      print('The extracted number is ' + m.group(1))

   else:
      print('No number found in ' + txt)


getNumber('hello world')
getNumber('the number is 42, what else?')

Github repository about-python, path: /standard-library/re/search-if.py