Python

Python Regular Expressions

In this post, we feature a comprehensive article about Regular Expressions in Python language.

1. What is a Regular Expression

If you have ever searched for a file that you didn’t exactly remember its name, then you might have used one or more special (or wildcard) characters to define the characters you couldn’t remember. For example, to search for all .txt files that start with b then you may have typed b*.txt in the file manager application (Windows Explorer, or Nautilus, or Finder) or in a Linux shell or DOS window. The above expression (b*.txt) is an example of pattern matching that is called more specifically globbing, and it matches the file names b.txt ba.txt bla.txt etc.

Common globbing characters:

  • ? matches any single character
  • * matches any number of characters
  • [...] matches any single character in the set, e.g. [a-c] matches only characters a, b and c.

Let’s see some definitions from Wikipedia:

So, in the above example with the file search, b*.txt is a search pattern that is applied to the file names of a directory for example, and matches the file names b.txt ba.txt bla.txt.

In this article we will learn about search patterns or regular expressions that are supported by python, as well as what commands to use in order to use regexes in your python programs.

2. Introduction to Regular Expressions in Python

Imagine that you have the task to search for emails in a text,
e.g.

>>> text = 'You may reach us at this email address:
java.info@javacodegeeks.com. Opening hours: @9am-@17pm'

How would you search for the email address in the above text?

index = text.find('@')

to find the index of @ and then you need to locate the other parts of the email address (left as an exercise to the reader; please don’t use regular expressions). Using regular expressions you would type something like:

>>> import re
>>> ans = re.search('@', text)

Much simpler, isn’t it? In the following we shall see how we can refine our regular expression string inside the search() method in order to be able to correctly identify the email address.

As another example, let’s see how we could search for the pattern b*.txt in our current folder:

>>> import glob
>>> print (glob.glob('b*.txt'))
['b.txt' 'ba.txt' 'bla.txt']

Not that difficult, was it?

2.1 Python Regular Expression methods

The re module provides a number of pattern matching functions:

MethodExplanation
match(regex, string) finds the regex at the beginning of the string and returns a Match object with start() and end() methods to retrieve the indices or None
search(regex, string)finds the regex anywhere in the string and returns a Match object with start() and end() methods to retrieve the indices or None
fullmatch(regex, string)returns a Match object if the regex matches the string entirely, otherwise it returns None
findall(regex, string)finds all occurrences of the regex in the string and returns a list of matching strings
finditer(regex, string)returns an iterator to loop over the regex matches in the string, e.g. for m in finditer(regex, string) where m is of type Match
sub(regex, replacement, string) replaces all matches of regex in string with replacement and returns a new string with the replacements
split(regex, string)returns a list of strings that contains the parts of string between all the regex matches in the string
compile(regex) compiles the regex into a regular expression object

Similarly, the glob module provides the pattern matching function:

  • glob(pattern) finds the files that satisfy the (globbing) pattern

In the rest of this article we shall focus on regular expressions (not on globbing).

You may test the regular expressions in this article either to your python environment (e.g. irb) or in one of the following online regex sites (the list is not exhaustive and you may find a better site for your needs or taste):

The simplest regular expression is just a text string without any special characters, i.e. a literal. Literals are the simplest form of pattern matching in regular expressions. They will simply succeed whenever that literal is found in the text you search in. E.g. let’s search the pattern email in text defined above using the python API:

>>> regex = 'email'
>>> re.match(regex, text)
>>> re.search(regex, text)
<_sre.SRE_Match object; span=(25, 30), match='email'>
>>> re.findall(regex, text)
['email']

Do you understand the results? re.match() tries to find the pattern in the beginning of text but text doesn’t start with the word email. re.search() tries to find the pattern anywhere in text and returns an instance of match saying that the pattern was found starting at position 25 and ending at position 30 of text (numbering starts from 0). Finally, re.findall() returns a list of matching strings, so 'email‘ was found only once. After the above explanation, the following result should now be self-explanatory:

>>> re.findall('@', text)
['@', '@', '@']

Regular expressions can be used to substitute part of a string:

>>> regex = '@'
>>> replacement = '#'
>>> re.sub(regex, replacement, text)
'You may reach us at this email address: java.info#javacodegeeks.com. Opening hours: #9am-#17pm'

Note that text is not altered.

To improve performance, e.g. when we reuse the regular expression in a number of matches, we can compile the regex into a regular expression object:

>>> regex = re.compile('@')
>>> regex.findall(text)
['@', '@', '@']

2.2 The Match object

The Matchobject contains a number of methods:

  • group() returns the part of the string matched by the entire regular expression
  • start() returns the offset in the string of the start of the match (begins from 0)
  • end() returns the offset of the character after the end of the match
  • span() returns a 2-tuple of start() and end()
>>> match = re.search(regex, text)
>>> match.group()
'email'
>>> match.start()
25
>>> match.end()
30
>>> match.span()
(25, 30)

A common mistake is the following:

>>> match = re.search("python", text)
>>> match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

For that reason, create a useful method like this:

def findMatch(regex, text):
    match = re.search(regex, text)
    if match:
       print(match.group())
    else:
       print("Pattern not found!")

More efficient than re.findall(regex, text) is re.finditer(regex, text). It returns an iterator that enables you to loop over the regex matches in the text:

>>> for m in re.finditer(regex, text):
...    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
... 
25-30: email

The for-loop variable m is a Match object with the details of the current match.

3. Meta-characters

The following characters have special meanings in regular expressions:

Meta-characterMeaning
. Any single character
[ ], [^ ]Any single character in the (character) set, or not (^) in the set (order doesn’t matter)
? Quantifier: Optional, i.e. zero or one of the preceding regular expression
* Quantifier: Zero or more of the preceding regular expression
+ Quantifier: One or more of the preceding regular expression
| Or
^Anchor pattern to the beginning of a line
$Anchor pattern to the end of a line
( )Group characters
{ }Quantifier: Number of time(s) of the preceding regular expression.
{n} means exactly n times
{n, m} or {n-m} means n and m times (inclusive)
{n,} or {,m} means at least n or at most m times
\Escapes a meta-character, i.e. it means that the character that follows it is not a meta-character.

So if we wish to search for the above meta-characters, we need to escape them by using the escape meta-character (\). The following table shows examples of the escape meta-character.

Non-printable characterMeaning
\nNewline
\rCarriage return
\eEscape
\tTab
\vVertical tab
\fForm feed
\uXXXX Unicode characters, e.g. \u20AC represents the

Let’s see some examples:

The dot (.) matches a single character, except line break characters.

>>> regex = "gr.y"
>>> text = "gr gry grey gray gryy grrrr graaaay gr%y"
>>> re.findall(regex, text)
['grey', 'gray', 'gryy', 'gr%y']

If we wanted to match words with only 3 characters that end with y:

>>> regex = "...y"
>>> re.findall(regex, text)
[' gry', 'grey', 'gray', ' gry', 'aaay', 'gr%y']
>>> regex = "gr.*y"
>>> re.findall(regex, text)
['gr gry grey gray gryy grrrr graaaay gr%y']
>>> regex = "gr.+y"
>>> re.findall(regex, text)
['gr gry grey gray gryy grrrr graaaay gr%y']
>>> regex = "gr.?y"
>>> re.findall(regex, text)
['gry', 'grey', 'gray', 'gryy', 'gr%y']

.* means any single character zero or more times, .+ means any single character one or more times and .? means any single character zero or one time. You might be astonished with the results of "gr.*y" and "gr.+y" but they are correct; they simply match the whole text because it indeed starts with gr and ends with y (the one of the word gr%y) containing all the other characters between them.

These modifiers are called greedy, because they try to match as much as possible to have the biggest match result possible. You can convert them to non-greedy by adding an extra question mark to the quantifier; for example, ??, *? or +?. A quantifier marked as non-greedy or reluctant tries to have the smallest match possible.

>>> regex = "gr.*?y"
>>> re.findall(regex, text)
['gr gry', 'grey', 'gray', 'gry', 'grrrr graaaay', 'gr%y']

If we need to match one or more of the meta-characters in our text, then the escape meta-character shows its use:

>>> text="How is life?"
>>> regex="life\?"
>>> findMatch(regex, text)
life?

You can also split a string:

>>> text = "gr, gry, grey, gray, gryy, grrrr, graaaay, gr%y"
>>> regex=", "
>>> re.split(regex, text)
['gr', 'gry', 'grey', 'gray', 'gryy', 'grrrr', 'graaaay', 'gr%y']

4. Character sets or classes

But what if we want to find only the correctly spelled words, i.e. only grey and gray? Character sets (or character classes) to the rescue:

>>> regex = "gr[ae]y"
>>> re.findall(regex, text)
['grey', 'gray']

A character set or class matches only one out of several characters. The order of the characters inside a character class does not matter. A hyphen (-) inside a character class specifies a range of characters. E.g. [0-9] matches a single digit between 0 and 9. You can use more than one range, e.g. [A-Za-z] matches a single character, case insensitively.

How
would you match the email address in the first text (repeated here)?

>>> text = 'You may reach us at this email address: java.info@javacodegeeks.com. Opening hours: @9am-@17pm'

Using character classes it shouldn’t be that difficult. An email address consists of capital or lowercase letters, i.e. [A-Za-z] and maybe a dot (.) in our example, i.e. [A-Za-z.] (no need to escape the dot inside a character set). But, this matches only one character in the set. To match any number of these characters we need to (you guessed right) [A-Za-z.]+. So you end up to:

>>> regex = "[A-Za-z.]+@[A-Za-z]+\.[A-Za-z]{2,3}"
>>> re.findall(regex, text)
['java.info@javacodegeeks.com']
ElementExplanation
[A-Za-z.]Matches any latin letter and the dot
+One or more times
@matches the character @
[A-Za-z]matches any latin letter
+One or more times
\.Matches the dot
[A-Za-z]matches any latin letter
{2,3}2 or 3 times

Congratulations! You wrote your first actual regular expression. Note that you need to escape the dot outside a character class. To restrict to a specific number of characters use { }. {2, 3} means 2 or 3 characters maximum as the last part of the email is usually 2 or 3 characters, e.g. eu or com. (Of course, nowadays there are many other domains e.g. info or ac.uk but this is left as an exercise to the reader).

We matched the dot in the domain by escaping it:

>>> regex = "[A-Za-z.]+@[A-Za-z]+\.[A-Za-z]{2,3}"

If we didn’t, it would still work:

>>> regex = "[A-Za-z.]+@[A-Za-z]+.[A-Za-z]{2,3}"
>>> re.findall(regex, text)
['java.info@javacodegeeks.com']

but

>>> email="java.info@javacodegeeks~com"
>>> re.findall(regex, email)
['java.info@javacodegeeks~com']

because dot (.) matches any character.

If you need to have characters not in the character set, then use the caret ^ as in [^A-Za-z] which means any character that is not a letter.

Since certain character classes are used often, a series of shorthand character classes are available:

ShorthandCharacter setMatches
\d (\D)[0-9] ([^0-9]) digit (non digit)
\w (\W)[A-Za-z0-9_] ([^A-Za-z0-9_]) word (non word)
\s (\S)[ \t\r\n\f] ([^ \t\r\n\f]) whitespace (non whitespace)
\ABeginning of string
\ZEnd of string
\b (\B)Word boundary, e.g. spaces, commas, colons, hyphens etc. (non word boundary)

So, the previous regex can also be written as:

>>> regex = "[\w.]+@[\w]+\.[\w]{2,3}"
>>> re.findall(regex, text)
['java.info@javacodegeeks.com']

and to avoid matching ".@javacodegeeks.com":

>>> email = ".@javacodegeeks.com"
>>> regex = "\w[\w.]+@[\w]+\.[\w]{2,3}"
>>> re.findall(regex, email)
[]

To find the first and last word:

>>> regex = "^\w+"
>>> findMatch(regex, text)
You
>>> regex = "\w+$"
>>> findMatch(regex, text)
17pm

Let’s take a look at another example:

>>> text = "Hello do you want to play Othello?"
>>> regex = "[Hh]ello"
>>> re.findall(regex, text)
['Hello', 'hello']

Where does the second 'hello' come from? From 'Othello'. How can we tell Python that we wish to match whole words only?

>>> regex = r"\b[Hh]ello\b"
>>> re.findall(regex, text)
['Hello']

Please note that we define regex as a raw string, otherwise we would have to type:

>>> regex = "\\b[Hh]ello\\b"

Python 3.4 adds a new re.fullmatch() function which returns a Match object only if the regex matches the entire string, otherwise it returns None. re.fullmatch(regex, text) is the same as re.search("\Aregex\Z", text). If text is an empty string then fullmatch() evaluates to True for any regex that can find a zero-length match.

Be careful when using the negated shorthands inside square brackets. E.g. [\D\S] is not the same as [^\d\s]. The latter matches any character that is neither a digit nor whitespace. The former, however, matches any character that is either not a digit, or is not whitespace. Because all digits are not whitespace, and all whitespace characters are not digits, [\D\S] matches any character; digit, whitespace, or otherwise.

5. Grouping

Imagine that we wish to match only a number of TLDs (Top Level Domain)s of email addresses:

>>> email = "java.info@javacodegeeks.net"
>>> regex = "\w[\w.]+@[\w]+\.com|net|org|edu"
>>> re.findall(regex, email)
['net']

Apparently, the | (or) meta-character doesn’t work here. We need to group the TLDs:

>>> regex = "\w[\w.]+@[\w]+\.(com|net|org|edu)"
>>> findMatch(regex, email)
java.info@javacodegeeks.net

This can also be useful if we wish to match the name and the domain of the email address, e.g.

>>> regex = "(\w[\w.]+)@([\w]+\.[\w]{2,3})"
>>> match = re.search(regex, email)
>>> match
<_sre.SRE_Match object; span=(0, 27), match='java.info@javacodegeeks.net'>
>>> match.group()
'java.info@javacodegeeks.net'
>>> match.groups()
('java.info', 'javacodegeeks.net')
>>> match.group(1)
'java.info'
>>> match.group(2)
'javacodegeeks.net'

match.group() or match.group(0) returns the whole match. To push the example a bit further (nested groups):

>>> regex = "(\w[\w.]+)@([\w]+\.(com|net|org|edu))"
>>> match = re.search(regex, email)
>>> match
<_sre.SRE_Match object; span=(0, 27), match='java.info@javacodegeeks.net'>
>>> match.group()
'java.info@javacodegeeks.net'
>>>  match.groups()
('java.info', 'javacodegeeks.net', 'net')
>>> match.group(1)
'java.info'
>>> match.group(2)
'javacodegeeks.net'
>>> match.group(3)
'net'

If you pay attention to the parentheses groups, you will see 3 groups:

(\w[\w.]+)
([\w]+\.(com|net|org|edu))
(com|net|org|edu)

The results of match.group() should now be obvious.

If you do not need the group to capture its match, use a non-capturing group with the syntax (?:regex). For example, if we don’t wish to include the TLDs in our match:

>>> regex = "(\w[\w.]+)@([\w]+\.(?:com|net|org|edu))"
>>> match = re.search(regex, email)
>>> match
<_sre.SRE_Match object; span=(0, 27), match='java.info@javacodegeeks.net'>
>>> match.group()
'java.info@javacodegeeks.net'
>>> match.groups()
('java.info', 'javacodegeeks.net')
>>> match.group(1)
'java.info'
>>> match.group(2)
'javacodegeeks.net'
>>> match.group(3)
Traceback (most recent call last):
  File "", line 1, in 
IndexError: no such group

Python was the first programming language which introduced named capturing groups. The syntax (?P<name>regex) captures the match of regex into the backreference name. name must be an alphanumeric sequence starting with a letter. You can reference the contents of the group with the named backreference \g<name>.

>>> regex = "(\w[\w.]+)@([\w]+\.(?P<TLD>com|net|org|edu))"
>>> match = re.search(regex, email)
>>> match.group("TLD")
'net'

Python does not allow multiple groups to use the same name. Doing so will give a regex compilation error.

As an exercise, write the regular expression of an address in the Netherlands, e.g.

text = 'George Maduroplein 1, 2584 RZ, The Hague, The Netherlands'

Make sure that you can return independent matches of the street and house number, the zip code, the city or the country.

5.1 Backreferences

Backreferences match the same text as previously matched by a capturing group. Perhaps the best known example is the regex to find duplicated words.

>>> text = "hello hello world"
>>> regex = r"(\w+) \1" 
>>> re.findall(regex, text) 
['hello']

In the above example we’re capturing a group made up of one or more alphanumeric characters, after which the pattern tries to match a whitespace, and finally we have the \1 backreference, meaning that it must match exactly the same thing as the first group (\w+). Also, note the use of raw strings to avoid typing

>>> regex = "(\w+) \\1"

Backreferences can be used with the first 99 groups. Named groups, that we saw earlier, can help reducing the complexity in case of many groups in the regular expression. To backreference a named group use the syntax (?P=name):

>>> regex = r"(?P<word>\w+) (?P=word)"
>>> re.findall(regex, text)
['hello']

6. Matching modes

search(regex, string, modes)and match(regex, string, modes) accept a third parameter called matching modes.

Matching modeGrouping letterExplanation
re.I or re.IGNORECASE iIgnores case
re.S or re.DOTALL s makes the dot (.) match newlines
re.M or re.MULTILINEmmakes the ^ and $ match after and before line breaks
re.L or re.LOCALE Lmakes \w match all characters that are considered letters given the current locale settings
re.U or re.UNICODEu treats all letters from all scripts as word characters
>>> text = "gry Grey grey gray gryy grrrr graaaay"
>>> regex="gr[ae]y"
>>> re.search(regex, text, re.I)
<_sre.SRE_Match object; span=(4, 8), match='Grey'>

Use the | meta-character to specify more than one matching modes.

Or you can use the grouping letter mentioned in the above table:

>>> text = "gry Grey grey gray gryy grrrr graaaay"
>>> regex=r"(?i)gr[ae]y"
>>> re.search(regex, text)
<_sre.SRE_Match object; span=(4, 8), match='Grey'>

7. Unicode

Since version 3.3, Python provides good support for Unicode regex pattern matching. As mentioned above, the \uFFFF syntax must be used. For example, to match one or more digits ending with :

>>> text = 'This item costs 33€.'
>>> regex = "\d+\u20AC"
>>> re.findall(regex, text)
['33€']

8. Summary

In this tutorial we provided an overview of Regular Expressions and saw how we can execute Regular Expressions in Python. Python provides module re for this job. We saw plenty of examples to use in your real life projects. I hope that after this tutorial you will be scared of regexes a bit less. This article is by no means exhaustive. The interested reader should look at the references for more in-depth knowledge of regular expressions. To quiz yourself, what is the difference between the characters [], () and {} in regular expressions?

10. References

  1. https://www.regular-expressions.info/
  2. Friedl J.E.F. (2006), Mastering Regular Expressions, 3rd Ed., O’Reilly.
  3. Krasnov A. (2017), “Python Regular Expression Tutorial”, WebCodeGeeks.
  4. Lopez F. & Romero V. (2014), Mastering Python Regular Expressions, Packt.

9. Download the source code

Download
You can download the full source code of this article here: Python Regular Expressions

Ioannis Kostaras

Software architect awarded the 2012 Duke's Choice Community Choice Award and co-organizing the hottest Java conference on earth, JCrete.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button