Regular expressions

12 Apr 2020

Extremely useful to understand what the pattern means: https://regexr.com/

`re.search()` pattern

match = re.search(pat, str)

Basic patterns

a, X, 9, < – ordinary characters just match themselves exactly. Meta-characters don’t: . ^ $ * + ? { \ | ( )
. (a period) – matches any single character except newline ‘\n’
\w – (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0-9_].
\W (upper case W) matches any non-word character
\b – boundary between word and non-word
\s – (lowercase s) matches a single whitespace character. \S (upper case S) matches any non-whitespace character
\t, \n, \r – tab, newline, return
\d – decimal digit [0-9]
^ = start, $ = end – match the start or end of the string
\ – inhibit the “specialness” of a character. So, for example, use . to match a period or \ to match a slash

import re
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group())
else:
  print('did not find')

found word:cat

re.search(r'iii', 'piiig').group()

'iii'

re.search(r'..g', 'piiig')

<re.Match object; span=(2, 5), match='iig'>

re.search(r'\d\d\d', 'p123g')

<re.Match object; span=(1, 4), match='123'>

re.search(r'\w\w\w', '@@abcd!!')

<re.Match object; span=(2, 5), match='abc'>

Repetition

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible

+ – 1 or more occurrences of the pattern to its left, e.g. ‘i+’ = one or more i’s
* – 0 or more occurrences of the pattern to its left
? – match 0 or 1 occurrences of the pattern to its left

re.search(r'pi+', 'piiig')

<re.Match object; span=(0, 4), match='piii'>

re.search(r'i+', 'piigiiii')

<re.Match object; span=(1, 3), match='ii'>

re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')

<re.Match object; span=(2, 9), match='1 2   3'>

re.search(r'b\w+', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

Square Brackets

[abc] matches ‘a’ or ‘b’ or ‘c’.

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
  print(match.group())

alice-b@google.com

Group Extraction

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
  print(match.group())  ## 'alice-b@google.com' (the whole match)
  print(match.group(1))  ## 'alice-b' (the username, group 1)
  print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com

`findall`

re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings

  ## Suppose we have a text with many email addresses
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
  emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
  for email in emails:
    # do something with each found email string
    print(email)

alice@google.com
bob@abc.com

`findall` with files

# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())

`findall` and groups

Using the parenthesis () group mechanism.

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print (tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
  print (tuple[0])  ## username
  print (tuple[1])  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com

Use of `?` and `^` for greedy vs non-greedy

str = '<b>foo</b> and <i>so on</i>' 
re.search(r'<.*>', str)

<re.Match object; span=(0, 27), match='<b>foo</b> and <i>so on</i>'>

re.findall(r'<.*?>', str)

['<b>', '</b>', '<i>', '</i>']

re.findall(r'[^>]*', str)

['<b', '', 'foo</b', '', ' and <i', '', 'so on</i', '', '']

Substiution

re.sub(pat, replacement, str)

The function searches for all the instances of pattern in the given string, and replaces them.

  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  ## re.sub(pat, replacement, str) -- returns new string with all replacements,
  ## \1 is group(1), \2 group(2) in the replacement
  print (re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

Non-capturing group

re.findall(r'(?:ha)+', 'hahaha haa hah!')

['hahaha', 'ha', 'ha']

re.findall(r'(ha)+', 'hahaha haa hah!')

['ha', 'ha', 'ha']