Greediness Limitation in Python Regular Expressions
Regular expressions are greedy by default. This means that they capture the maximum number of characters possible. Let's look at an example. Let's say we have a string like this:
txt = 'aeeex zzz x kkk'
In this string we want to find the substring 'aeeex'
using the following pattern: letter 'a'
, then any symbol one or more times, then letter 'x'
:
res = re.sub('a.+x', '!', txt)
print(res)
Although we need to get the string '! zzz
x kkk'
, the string '! kkk'
will be output. The thing is that our regular expression searches for all characters from the letter 'a'
to the letter 'x'
. But in our string there are two letters 'x'
! Because of greed, it turns out that the regular expression searches until the very last x, thereby capturing something other than what we needed.
Of course, this is often the behavior we want. But in this particular case, we need to cancel greed and tell the regular expression to search until the first X. In this case, we should put a question mark after the repetition operator:
res = re.sub('a.+?x', '!', txt)
print(res) # a line '! zzz x kkk'
Greediness can be limited for all repetition operators: *
, ?
, and {}
- like this: *?
, ??
, and {}?
.
Given a string:
txt = 'aba accca azzza wwwwa'
Write a regular expression that will find all lines with the letters 'a'
at the edges, and replace each of them with '!'
. Between the letters 'a'
there can be any character (except 'a'
).