Regular expressions

Extremely useful to understand what the pattern means: https://regexr.com/

re.search() pattern

match = re.search(pat, str)

Basic patterns

  • a, X, 9, < – ordinary characters just match themselves exactly. Meta-characters don’t: . ^ $ * + ? { \ | ( )

  • . (a period) – matches any single character except newline ‘\n’

  • \w – (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0-9_].

  • \W (upper case W) matches any non-word character

  • \b – boundary between word and non-word

  • \s – (lowercase s) matches a single whitespace character. \S (upper case S) matches any non-whitespace character

  • \t, \n, \r – tab, newline, return

  • \d – decimal digit [0-9]

  • ^ = start, $ = end – match the start or end of the string

  • \ – inhibit the “specialness” of a character. So, for example, use . to match a period or \ to match a slash

import re
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group())
else:
  print('did not find')
found word:cat
re.search(r'iii', 'piiig').group()
'iii'
re.search(r'..g', 'piiig')
<re.Match object; span=(2, 5), match='iig'>
re.search(r'\d\d\d', 'p123g')
<re.Match object; span=(1, 4), match='123'>
re.search(r'\w\w\w', '@@abcd!!') 
<re.Match object; span=(2, 5), match='abc'>

Repetition

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible

  • + – 1 or more occurrences of the pattern to its left, e.g. ‘i+’ = one or more i’s
  • * – 0 or more occurrences of the pattern to its left
  • ? – match 0 or 1 occurrences of the pattern to its left
re.search(r'pi+', 'piiig')
<re.Match object; span=(0, 4), match='piii'>
re.search(r'i+', 'piigiiii')
<re.Match object; span=(1, 3), match='ii'>
re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')
<re.Match object; span=(2, 9), match='1 2   3'>
re.search(r'b\w+', 'foobar')
<re.Match object; span=(3, 6), match='bar'>

Square Brackets

[abc] matches ‘a’ or ‘b’ or ‘c’.

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
  print(match.group())
alice-b@google.com

Group Extraction

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
  print(match.group())  ## 'alice-b@google.com' (the whole match)
  print(match.group(1))  ## 'alice-b' (the username, group 1)
  print(match.group(2))  ## 'google.com' (the host, group 2)
alice-b@google.com
alice-b
google.com

findall

re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings

  ## Suppose we have a text with many email addresses
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
  emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
  for email in emails:
    # do something with each found email string
    print(email)
alice@google.com
bob@abc.com

findall with files

# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())

findall and groups

Using the parenthesis () group mechanism.

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print (tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
  print (tuple[0])  ## username
  print (tuple[1])  ## host
[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com

Use of ? and ^ for greedy vs non-greedy

str = '<b>foo</b> and <i>so on</i>' 
re.search(r'<.*>', str)
<re.Match object; span=(0, 27), match='<b>foo</b> and <i>so on</i>'>
re.findall(r'<.*?>', str)
['<b>', '</b>', '<i>', '</i>']
re.findall(r'[^>]*', str)
['<b', '', 'foo</b', '', ' and <i', '', 'so on</i', '', '']

Substiution

re.sub(pat, replacement, str)

The function searches for all the instances of pattern in the given string, and replaces them.

  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  ## re.sub(pat, replacement, str) -- returns new string with all replacements,
  ## \1 is group(1), \2 group(2) in the replacement
  print (re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))
purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

Non-capturing group

re.findall(r'(?:ha)+', 'hahaha haa hah!')
['hahaha', 'ha', 'ha']
re.findall(r'(ha)+', 'hahaha haa hah!')
['ha', 'ha', 'ha']