Regular expressions are a powerful and (mostly) standardized way of searching, replacing, and parsing text with complex patterns of characters.
In Python, all functionality related to regular expression is in re
module.
re.compile(pattern[, flags])
: compiles a RegEx into an object. It is recommended to use raw Python strings r'someString'
for pattern
because it doesn’t require escaping. It can contain an optional prefix ?P<PATTERNNAME>
to assign a name to the pattern.
There are some useful flags to customize the behavior of various methods.
re.IGNORECASE
orre.I
enables case-insensitive text operations;re.M
orre.MULTILINE
enables multi-line environment so that^
matches the beginning of the string and the beginning of the newline, and$
matches the end of the string and the end of each line.re.A
orre.ASCII
Make\w
,\W
,\b
,\B
,\d
,\D
,\s
and\S
perform ASCII-only matching instead of full Unicode matching defaulted in Python 3.re.S
orre.DOTALL
let.
match all characters including newline.re.X
orre.VERBOSE
enables verbose regular expressions. In verbose regular expression, whitespace (spaces, tabs, carriage returns) and comments (everything follows a#
) are ignored. To escape a whitespace, a\
is required. We can write verbose regular expression as a multiline string, and comment every part of it in a separated line. For example,a
andb
are functionally the same:a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits """, re.X) b = re.compile(r"\d+\.\d*")
Parentheses in pattern string define groups, which can be extracted by group(id)
later. For example, ^(\d{3})-(\d{3,8})$
defines two groups. group(0)
is the whole match, and group(1)
and group(2)
are the two matching groups.
By default, regex match is greedy, that is, it will match as many characters as possible. By adding a ?
to a pattern`, it becomes non-greedy, i.e. match as few characters as possible.
Functions for Regex
There are two ways to use functions in re
module:
re.match(pattern, string[, flags])
RegexObject.match(string,[ pos[, endpos]])
, andRegexObject
is precompiled withre.compile
.
If a match is found, a match object is returned; otherwise, None
is returned. Match object is always evaluated to True
, so we can do this:
if re.search(regexp, exr):
doSomething()
match
findspattern
from the beginning of the string.RegexObject.match(string[, pos[, endpos]])
will match frompos
instead of the beginning.fullmatch
performs a match on the whole string.search
: finds the first location ofpattern
in a string.split
splits string by the occurrences of pattern and return a list. It takes an optionalmaxsplit
parameter, default to 0. It is more flexible thanstr.split
method.findall
returns a list of all the substrings that matched the pattern. Note: it doesn’t return overlapping matches.finditer
returns an iterator of all matching substrings, instead of a list.sub
: replaces ALL occurrences of pattern in string with replacement. If no occurrences are found, it acts like a no-op.replacement
can be a function that takes a single match object and returns the replacement string.subn
returns the number of matches too.
Matching Object
group(id)
returns one or more subgroups of the match. By default,0
is the whole match,1
,2
, … are matching groups defined by parentheses. If?P<name>...
syntax (to define group name) is used,name
can be passed in togroup
to get the match. It also takesgroup(id1, id2, ...)
to return a tuple of matching groups.groups
returns a tuple containing all the subgroups of the match, from 1 up. it takes adefault
parameter to define unfound match value (default toNone
).groupdict
returns a dictionary containing all the named (?P<name>...
) subgroups of the match, keyed by the subgroup name.start
andend
returns the indices of the start or end of the match. It takes the optional group ID.span
returns the indices of the start and end of the match in a tuple.