Python Regular Expression Examples

These examples are adapted from the online book Python For Everyone Chapter 11, authored by C. R. Severance. https://www.py4e.com/html3/11-regex

The sample data are some emails. https://www.py4e.com/code3/mbox-short.txt

In [1]:
# Import the regular expression package in python
import re
In [2]:
# Search for lines that contain 'From'
hand = open('mbox-short.txt') # remember to change the filepath to the path on your computer
for line in hand:
    line = line.rstrip() #remove the white spaces etc at the end of the line
    if re.search('From:', line):
        print(line)
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
In [3]:
# Search for lines that start with 'From'
line = "edu From: rjlowe@iupui.edu"
if re.search('^From:', line):
    print("line starts with 'From:'")
else:
    print("line does not start with 'From:'")
line does not start with 'From:'
In [4]:
# Search for lines that start with From and have an at sign
line = "From: rjlowe@iupui.edu"
if re.search('^From:.+@', line): # "." can match any character, "+" means match at least one character, "*" means match zero to more chracters
    print("line starts with From and have an at sign")
else:
    print("line does not contain the pattern")
line starts with From and have an at sign
In [5]:
# Extract data using regular expression
# Extracting email addresses with a non-whitespace string before @ and another after @
# In regular expression, the pattern is \S+@\S+
# "@2PM" does not match with this regular expression because there is no string before @

line = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
items = re.findall('\S+@\S+', line) #"\S" means a non-whitespace character
print(items)
['csev@umich.edu', 'cwen@iupui.edu']
In [6]:
# Search and extract data using regular expression
# If you are only interested in the string after @, but you need the entire pattern to find it,
# you can add parentheses around the string using regular expression \S+@(\S+).
# It will return the string after @
line = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
items = re.findall('\S+@(\S+)', line) #"\S" means a non-whitespace character
print(items)
['umich.edu', 'iupui.edu']
In [7]:
# common regular expressions
# '^' - beginning of line 
# '$' - end of line
# '.' - any character
# '\d' - one digit number
# '*' - zero or more occurrences
# '+' - one or more occurrences
# '\S' - non-whitespace character
# '[a-z]' - all lowercase letters
# '[A-Z]' - all uppercase letters

line = "123abc456DEF"

# find the entire line
items = re.findall('^.*$', line) 
print(items)

# find all numbers
items = re.findall('(\d+)', line) 
print(items)

# find all strings that begin with one or more digits and end with one or more letters
items = re.findall('(\d+[a-zA-Z]+)', line) 
print(items)
['123abc456DEF']
['123', '456']
['123abc', '456DEF']

Exercise

In [12]:
# find all strings that begin with one or more digits and end with one or more lowercase letters
# the answer is '123abc'

line = "123abc456DEF"

# your code starts here
items = re.findall('(\d+[a-z]+)', line) 
print(items)

# your code ends here
['123abc']
In [15]:
# find the digits at the beginning of the line
# the answer is '123'

line = "123abc456DEF"

# your code starts here
items = re.findall('(^\d+)', line) 
print(items)

# your code ends here
['123']
In [28]:
# find the digits between letters
# the answer is '456'

line = "123abc456DEF"

# your code starts here
items = re.findall('[a-z](\d+)', line) 
print(items)

# your code ends here
['456']
In [ ]: