Regular expressions, often abbreviated as regex or regexp, powerful and concise sequences of characters that define a search pattern. For example, let’s consider a scenario where you want to check if a given string is a valid email address. Using regex, you may achieve this as follows:
import re
def is_valid_email(email):
# Define a RegEx for a simple email validation
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
# Use re.match to check if the email matches the pattern
match = re.match(email_pattern, email)
# If there is a match, the email is valid
return bool(match)
# Test the function with some examples
email1 = "user@example.com"
email2 = "invalid.email@com"
email3 = "missing@dotcom"
print(f"{email1} is valid: {is_valid_email(email1)}") # Output: True
## user@example.com is valid: True
print(f"{email2} is valid: {is_valid_email(email2)}") # Output: False
## invalid.email@com is valid: False
print(f"{email3} is valid: {is_valid_email(email3)}") # Output: False
## missing@dotcom is valid: False
In this example, the regular expression
'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
represents a email validation pattern, and the match
function from the re
module1 evaluates if there is
any pattern matching; if there is it returns True
for
match
, otherwise False
.
This is a basic illustration of how RegEx works in Python. As we can see, it allows us to represent an infinite-size sets of possible strings that matches the pattern in a compact and efficient way2.
Formal Definition of RegEx
The formal definition of regular expressions is inductive. Suppose that we have a finite alphabet \(\left(\Sigma\right)\). We start with specifying the following as regular expressions:
- \(\emptyset\): The empty set
- \(\varepsilon\): The set containing
the empty string
""
- Literal character \(a\): The one-element \(\{a\}\), for \(a \in \Sigma\)
From these basic expressions, we can build more complex regular expressions using the following three operations:
- Concatenation: If
R
andS
are regular expressions,RS
denotes the set of strings that can be formed by concatenating a string fromR
and a string fromS
. For example, ifR
matchesgood
andbad
, andS
matchesboy
andgirl
, thenRS
matchesgoodboy
,goodgirl
,badboy
, andbadgirl
. - Alternation: If
R
andS
are regular expressions,R|S
denotes the set of strings that match eitherR
orS
. For example, ifR
matchesgood
andbad
, andS
matchesboy
andgirl
, thenR|S
matchesgood
,bad
,boy
, andgirl
. - Kleene star: If
R
is a regular expression,R*
denotes the set of strings that can be formed by concatenating any finite number (including zero) of strings fromR
. For example, ifR
matchesgood
andbad
, thenR*
matchesgood
,bad
,goodgood
,goodbad
,badgood
,badbad
, and so on.
The hierarchy of operations among these three operations as follows: Kleeene star takes precedence, followed by concatenation, and then alternation.
In addition, if you need a different grouping for the string sets,
you may use the parentheses ()
. For example:
a|b*
: \(\{\varepsilon\),"a"
,"b"
,"bb"
,"bbb"
, \(... \}\)(a|b)*
: The set of all string containing onlya
andb
, \(\{\varepsilon\),"a"
,"b"
,"aa"
,"ab"
,"ba"
,"bb"
,"aaa"
, \(... \}\)ab*(c|epsilon)
: The set of strings starting with a singlea
followed by zero or moreb
’s, optionally ending with ac
, \(\{\)"a"
,"ac"
,"ab"
,"abc"
,"abb"
,"abbc"
, \(... \}\)
Practical Implementaion of RegEx in Python
In practical implementations of regular expressions, many additional symbols and operators exist. However, these are largely shortcuts for common operations that would be cumbersome to sepress solely using the three fundamental operations in the formal definition.
Quantification Operations
*
: Same as in the formal definition: zero or more times.?
: Zeror or one occurence of the preceding element. E.g.,colou?r
matchescolor
andcolour
+
: One or more occurrences of the preceding element{m}
: Exactlym
occurrences of the preceding element{m,}
: At leastm
occurrences of the preceding element{m,n}
: Betweenm
andn
occurrences of the preceding element, inclusive.
Alternative to |
Instead of using |
and preparing all the possible
characters in a regular expression, you can use the followings to
express it in a more concise way.
[]
: Matches any single character inside the brackets.[^ ]
: Negation, matches anything except the set of characters inside the brackets..
: Wildcard, matches any character.
Anchoring
^
(not inside square brackets) means that what comes after must be at the start of a line.$
means that what comes before must be at the end of a line.\<
anchors to the beginning of a word.\>
anchors to the end of a word. Note that we had to escape. Note that when you create a string using this operator, you will have to escape the\
.
Greedy Quantification
By default, quantifiers are greedy, meaning
they match the longest substring possible. We can make them have the
opposite behavior by modifying them with the ?
character:
in that case, they match the shortest substring possible.
The re
module
In Python, the re
module provides functions for working
with regular expressions.
import re
Let’s delve into some commonly used functions in the module with an example text:
zen_of_python = '''
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
'''
re.group()
: Retrieves the part of the
string that matched a capturing group in a regular expression search. It
is called on a match
object returned by re
module functions like re.search()
, re.match()
,
etc.
# "(?P<language>\w+)" searches for one or more characters (\w+).
# (?P marks the beginning of a named capture group definition.
# Captures the matched group characters into a named group called "text".
# ) marks the end of the named capture group definition.
match = re.search(r"(?P<text>\w+)", zen_of_python)
# Access the entire match (no capturing group specified)
entire_match = match.group()
print(entire_match)
## The
To access the captured group content (group number 1 or named group “text”):
matched_string = match.group(1) # Using group number, 1-based indexing
matched_string = match.group("text") # Using named group
print(matched_string)
## The
match = re.search(r"(?P<language>\w+)", zen_of_python) # Capturing group named "language"
# Access the entire match (no capturing group specified)
entire_match = match.group()
print(entire_match)
## The
re.search(pattern, string)
: Finds the
first occurrence of the pattern in the string, regardless of position.
Returns match object if found, None
otherwise.
result = re.search(r"Python", zen_of_python)
print(result.group()) # Output: Python
## Python
re.match(pattern, string)
: Checks if
the pattern matches at the beginning of the
string. Returns match object if found at the start, None
otherwise
result = re.match(r"Python", zen_of_python)
print(result.group()) # Output: None
## AttributeError: 'NoneType' object has no attribute 'group'
re.findall(pattern, string)
: Finds all
non-overlapping matches in the string. Returns list of matching strings,
excluding any duplicates.
result = re.findall(r"\w+", zen_of_python)
print(result) # Output: list of all words, excluding duplicates
## ['The', 'Zen', 'of', 'Python', 'by', 'Tim', 'Peters', 'Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex', 'Complex', 'is', 'better', 'than', 'complicated', 'Flat', 'is', 'better', 'than', 'nested', 'Sparse', 'is', 'better', 'than', 'dense', 'Readability', 'counts', 'Special', 'cases', 'aren', 't', 'special', 'enough', 'to', 'break', 'the', 'rules', 'Although', 'practicality', 'beats', 'purity', 'Errors', 'should', 'never', 'pass', 'silently', 'Unless', 'explicitly', 'silenced', 'In', 'the', 'face', 'of', 'ambiguity', 'refuse', 'the', 'temptation', 'to', 'guess', 'There', 'should', 'be', 'one', 'and', 'preferably', 'only', 'one', 'obvious', 'way', 'to', 'do', 'it', 'Although', 'that', 'way', 'may', 'not', 'be', 'obvious', 'at', 'first', 'unless', 'you', 're', 'Dutch', 'Now', 'is', 'better', 'than', 'never', 'Although', 'never', 'is', 'often', 'better', 'than', 'right', 'now', 'If', 'the', 'implementation', 'is', 'hard', 'to', 'explain', 'it', 's', 'a', 'bad', 'idea', 'If', 'the', 'implementation', 'is', 'easy', 'to', 'explain', 'it', 'may', 'be', 'a', 'good', 'idea', 'Namespaces', 'are', 'one', 'honking', 'great', 'idea', 'let', 's', 'do', 'more', 'of', 'those']
re.finditer(pattern, string)
: Similar
to findall
, but returns an iterator of match objects for
better memory efficiency when dealing with large texts.
matches = re.finditer(r"\w+", zen_of_python)
for match in matches:
print(match.group())
## The
## Zen
## of
## Python
## by
## Tim
## Peters
## Beautiful
## is
## better
## than
## ugly
## Explicit
## is
## better
## than
## implicit
## Simple
## is
## better
## than
## complex
## Complex
## is
## better
## than
## complicated
## Flat
## is
## better
## than
## nested
## Sparse
## is
## better
## than
## dense
## Readability
## counts
## Special
## cases
## aren
## t
## special
## enough
## to
## break
## the
## rules
## Although
## practicality
## beats
## purity
## Errors
## should
## never
## pass
## silently
## Unless
## explicitly
## silenced
## In
## the
## face
## of
## ambiguity
## refuse
## the
## temptation
## to
## guess
## There
## should
## be
## one
## and
## preferably
## only
## one
## obvious
## way
## to
## do
## it
## Although
## that
## way
## may
## not
## be
## obvious
## at
## first
## unless
## you
## re
## Dutch
## Now
## is
## better
## than
## never
## Although
## never
## is
## often
## better
## than
## right
## now
## If
## the
## implementation
## is
## hard
## to
## explain
## it
## s
## a
## bad
## idea
## If
## the
## implementation
## is
## easy
## to
## explain
## it
## may
## be
## a
## good
## idea
## Namespaces
## are
## one
## honking
## great
## idea
## let
## s
## do
## more
## of
## those