Escaping Special Characters in Python Regular Expressions
Let's say we want to make a special character represent itself. To do this, we need to escape it with a backslash. Let's look at some examples.
Example
In the following example, the author of the regular expression wanted the search pattern to look like this: letter 'a'
, then plus '+'
, then letter 'x'
. However, the author of the code did not escape the character '+'
and so the search pattern actually looks like this: letter 'a'
one or more times, then letter 'x'
:
txt = 'a+x ax aax aaax'
res = re.sub('a+x', '!', txt)
print(res)
Result of code execution:
'a+x ! ! !'
Example
And now the author has escaped the plus with a backslash. Now the search pattern looks as it should: letter 'a'
, then plus '+'
, then letter 'x'
:
txt = 'a+x ax aax aaax'
res = re.sub('a\+x', '!', txt)
print(res)
Result of code execution:
'! ax aax aaax'
Example
In this example, the pattern looks like this: letter 'a'
, then dot '.'
, then letter 'x'
:
txt = 'a.x abx azx'
res = re.sub('a\.x', '!', txt)
print(res)
Result of code execution:
'! abx azx'
Example
In the following example, the author forgot to escape the slash and all substrings were included in the regular expression, since an unescaped period denotes any character:
txt = 'a.x abx azx'
res = re.sub('a.x', '!', txt)
print(res)
Result of code execution:
'! ! !'
Comment
Note that if you forget the backslash for the period (when it should be itself) - you might not even notice:
res = re.sub('a.x', '!', 'a.x')
print(res) # will return '!', just as we wanted
Visually it works correctly (since the dot represents any character, including the regular dot '.'
). But if we change the line in which the substitutions occur, we will see our error:
res = re.sub('a.x', '!', 'a.x abx azx')
print(res) # will return '! ! !', and it was expected '! abx azx'
List of special and common characters
If you escape a regular symbol, nothing bad will happen - it will still denote itself. The exception is numbers, they cannot be escaped.
There is often doubt whether a given character is special. Some people go so far as to escape all suspicious characters in a row. However, this is bad practice (it clutters the regular expression with backslashes).
Are special characters: $ ^ . * + ? \ / {} [] () |
Are not special characters: @ : , ' " - _ = < > % # ~ `& !
Practical tasks
Given a string:
txt = 'a.a aba aea'
Write a regular expression that will find the string 'a.a'
without capturing the rest.
Given a string:
txt = '2+3 223 2223'
Write a regular expression that will find the string '2+3'
without capturing the rest.
Given a string:
txt = '23 2+3 2++3 2+++3 345 567'
Write a regular expression that will find the strings '2+3'
, '2++3'
, '2+++3'
, without capturing the rest (+ there can be any quantity).
Given a string:
txt = '23 2+3 2++3 2+++3 445 677'
Write a regular expression that will find the strings '23'
, '2+3'
, '2++3'
, '2+++3'
, without capturing the rest.
Given a string:
txt = '*+ *q+ *qq+ *qqq+ *qqq qqq+'
Write a regular expression that will find the strings '*q+'
, '*qq+'
, '*qqq+'
, without capturing the rest.
Given a string:
txt = '[abc] {abc} abc (abc) [abc]'
Write a regular expression that will find strings in square brackets and replace them with '!'
.