Regular Expressions

Regular expressions, often abbreviated as regex or regexp, powerful and concise sequences of characters that define a search pattern. For example, let’s consider a scenario where you want to check if a given string is a valid email address. Using regex, you may achieve this as follows:

import re

def is_valid_email(email):
    
    # Define a RegEx for a simple email validation
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    
    # Use re.match to check if the email matches the pattern
    match = re.match(email_pattern, email)
    
    # If there is a match, the email is valid
    return bool(match)
  
# Test the function with some examples
email1 = "user@example.com"
email2 = "invalid.email@com"
email3 = "missing@dotcom"

print(f"{email1} is valid: {is_valid_email(email1)}") # Output: True
## user@example.com is valid: True
print(f"{email2} is valid: {is_valid_email(email2)}") # Output: False
## invalid.email@com is valid: False
print(f"{email3} is valid: {is_valid_email(email3)}") # Output: False
## missing@dotcom is valid: False


In this example, the regular expression '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' represents a email validation pattern, and the match function from the re module1 evaluates if there is any pattern matching; if there is it returns True for match, otherwise False.

This is a basic illustration of how RegEx works in Python. As we can see, it allows us to represent an infinite-size sets of possible strings that matches the pattern in a compact and efficient way2.



Formal Definition of RegEx

The formal definition of regular expressions is inductive. Suppose that we have a finite alphabet \(\left(\Sigma\right)\). We start with specifying the following as regular expressions:

  • \(\emptyset\): The empty set
  • \(\varepsilon\): The set containing the empty string ""
  • Literal character \(a\): The one-element \(\{a\}\), for \(a \in \Sigma\)

From these basic expressions, we can build more complex regular expressions using the following three operations:

  • Concatenation: If R and S are regular expressions, RS denotes the set of strings that can be formed by concatenating a string from R and a string from S. For example, if R matches good and bad, and S matches boy and girl, then RS matches goodboy, goodgirl, badboy, and badgirl.
  • Alternation: If R and S are regular expressions, R|S denotes the set of strings that match either R or S. For example, if R matches good and bad, and S matches boy and girl, then R|S matches good, bad, boy, and girl.
  • Kleene star: If R is a regular expression, R* denotes the set of strings that can be formed by concatenating any finite number (including zero) of strings from R. For example, if R matches good and bad, then R* matches good, bad, goodgood, goodbad, badgood, badbad, and so on.

The hierarchy of operations among these three operations as follows: Kleeene star takes precedence, followed by concatenation, and then alternation.

In addition, if you need a different grouping for the string sets, you may use the parentheses (). For example:

  • a|b*: \(\{\varepsilon\), "a", "b", "bb", "bbb", \(... \}\)
  • (a|b)*: The set of all string containing only a and b, \(\{\varepsilon\), "a", "b", "aa", "ab", "ba", "bb", "aaa", \(... \}\)
  • ab*(c|epsilon): The set of strings starting with a single a followed by zero or more b’s, optionally ending with a c, \(\{\)"a", "ac", "ab", "abc", "abb", "abbc", \(... \}\)



Practical Implementaion of RegEx in Python

In practical implementations of regular expressions, many additional symbols and operators exist. However, these are largely shortcuts for common operations that would be cumbersome to sepress solely using the three fundamental operations in the formal definition.


Quantification Operations

  • *: Same as in the formal definition: zero or more times.
  • ?: Zeror or one occurence of the preceding element. E.g., colou?r matches color and colour
  • +: One or more occurrences of the preceding element
  • {m}: Exactly m occurrences of the preceding element
  • {m,}: At least m occurrences of the preceding element
  • {m,n}: Between m and n occurrences of the preceding element, inclusive.


Alternative to |

Instead of using | and preparing all the possible characters in a regular expression, you can use the followings to express it in a more concise way.

  • []: Matches any single character inside the brackets.
  • [^ ]: Negation, matches anything except the set of characters inside the brackets.
  • .: Wildcard, matches any character.


Anchoring

  • ^ (not inside square brackets) means that what comes after must be at the start of a line.
  • $ means that what comes before must be at the end of a line.
  • \< anchors to the beginning of a word.
  • \> anchors to the end of a word. Note that we had to escape. Note that when you create a string using this operator, you will have to escape the \.


Greedy Quantification

By default, quantifiers are greedy, meaning they match the longest substring possible. We can make them have the opposite behavior by modifying them with the ? character: in that case, they match the shortest substring possible.


The re module

In Python, the re module provides functions for working with regular expressions.

import re


Let’s delve into some commonly used functions in the module with an example text:

zen_of_python = '''
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
'''


re.group(): Retrieves the part of the string that matched a capturing group in a regular expression search. It is called on a match object returned by re module functions like re.search(), re.match(), etc.

# "(?P<language>\w+)" searches for one or more characters (\w+).
# (?P marks the beginning of a named capture group definition.
# Captures the matched group characters into a named group called "text".
# ) marks the end of the named capture group definition.
match = re.search(r"(?P<text>\w+)", zen_of_python) 

# Access the entire match (no capturing group specified)
entire_match = match.group()
print(entire_match) 
## The


To access the captured group content (group number 1 or named group “text”):

matched_string = match.group(1) # Using group number, 1-based indexing
matched_string = match.group("text") # Using named group
print(matched_string)
## The


match = re.search(r"(?P<language>\w+)", zen_of_python) # Capturing group named "language"

# Access the entire match (no capturing group specified)
entire_match = match.group()
print(entire_match) 
## The


re.search(pattern, string): Finds the first occurrence of the pattern in the string, regardless of position. Returns match object if found, None otherwise.

result = re.search(r"Python", zen_of_python)
print(result.group()) # Output: Python
## Python


re.match(pattern, string): Checks if the pattern matches at the beginning of the string. Returns match object if found at the start, None otherwise

result = re.match(r"Python", zen_of_python)
print(result.group()) # Output: None
## AttributeError: 'NoneType' object has no attribute 'group'


re.findall(pattern, string): Finds all non-overlapping matches in the string. Returns list of matching strings, excluding any duplicates.

result = re.findall(r"\w+", zen_of_python)
print(result) # Output: list of all words, excluding duplicates
## ['The', 'Zen', 'of', 'Python', 'by', 'Tim', 'Peters', 'Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex', 'Complex', 'is', 'better', 'than', 'complicated', 'Flat', 'is', 'better', 'than', 'nested', 'Sparse', 'is', 'better', 'than', 'dense', 'Readability', 'counts', 'Special', 'cases', 'aren', 't', 'special', 'enough', 'to', 'break', 'the', 'rules', 'Although', 'practicality', 'beats', 'purity', 'Errors', 'should', 'never', 'pass', 'silently', 'Unless', 'explicitly', 'silenced', 'In', 'the', 'face', 'of', 'ambiguity', 'refuse', 'the', 'temptation', 'to', 'guess', 'There', 'should', 'be', 'one', 'and', 'preferably', 'only', 'one', 'obvious', 'way', 'to', 'do', 'it', 'Although', 'that', 'way', 'may', 'not', 'be', 'obvious', 'at', 'first', 'unless', 'you', 're', 'Dutch', 'Now', 'is', 'better', 'than', 'never', 'Although', 'never', 'is', 'often', 'better', 'than', 'right', 'now', 'If', 'the', 'implementation', 'is', 'hard', 'to', 'explain', 'it', 's', 'a', 'bad', 'idea', 'If', 'the', 'implementation', 'is', 'easy', 'to', 'explain', 'it', 'may', 'be', 'a', 'good', 'idea', 'Namespaces', 'are', 'one', 'honking', 'great', 'idea', 'let', 's', 'do', 'more', 'of', 'those']


re.finditer(pattern, string): Similar to findall, but returns an iterator of match objects for better memory efficiency when dealing with large texts.

matches = re.finditer(r"\w+", zen_of_python)

for match in matches:
    print(match.group()) 
## The
## Zen
## of
## Python
## by
## Tim
## Peters
## Beautiful
## is
## better
## than
## ugly
## Explicit
## is
## better
## than
## implicit
## Simple
## is
## better
## than
## complex
## Complex
## is
## better
## than
## complicated
## Flat
## is
## better
## than
## nested
## Sparse
## is
## better
## than
## dense
## Readability
## counts
## Special
## cases
## aren
## t
## special
## enough
## to
## break
## the
## rules
## Although
## practicality
## beats
## purity
## Errors
## should
## never
## pass
## silently
## Unless
## explicitly
## silenced
## In
## the
## face
## of
## ambiguity
## refuse
## the
## temptation
## to
## guess
## There
## should
## be
## one
## and
## preferably
## only
## one
## obvious
## way
## to
## do
## it
## Although
## that
## way
## may
## not
## be
## obvious
## at
## first
## unless
## you
## re
## Dutch
## Now
## is
## better
## than
## never
## Although
## never
## is
## often
## better
## than
## right
## now
## If
## the
## implementation
## is
## hard
## to
## explain
## it
## s
## a
## bad
## idea
## If
## the
## implementation
## is
## easy
## to
## explain
## it
## may
## be
## a
## good
## idea
## Namespaces
## are
## one
## honking
## great
## idea
## let
## s
## do
## more
## of
## those

  1. It is a Python built-in module for RegEx.↩︎

  2. A slightly better way to achieve the same goal would be is_valid_email_simple_lambda = lambda email: '@' in email and '.' in email and email.index('.') > email.index('@')↩︎

Post a Comment

0 Comments