2

I understand how to make matching case in-sensitive in Python, and I understand how to use lookahead / lookbehinds, but how do I combine the two?

For instance, my text is

mytext = I LOVE EATING popsicles at home.

I want to extract popsicles from this text (my target food item). This regex works great:

import re
regex = r'(?<=I\sLOVE\sEATING\s)[a-z0-9]*(?=\sat\shome)'
re.search(regex, mytext)

However, I'd like to account for the scenario where someone writes

i LOVE eating apples at HOME.

That should match. But "I LOVE eating Apples at home" should NOT match, since Apples is uppercase.

Thus, I'd like to have local case insensitivity in my two lookahead (?=\sat\shome)and lookbehind (?<=I\sLOVE\sEATING\s) groups. I know I can use re.IGNORECASE flags for global case insensitivity, but I just want the lookahead/behind groups to be case insensitive, not my actual target expression.

Traditionally, I'd prepend (?i:I LOVE EATING) to create a case-insensitive non-capturing group that is capable of matching both I LOVE EATING and I love eating. However, If I try to combine the two together:

(?i:<=I\sLOVE\sEATING\s)

I get no matches, since it now interprets the i: as a literal expression to match. Is there a way to combine lookaheads/behinds with case sensitivity?

Edit: I don’t think this is a duplicate of the marked question. That question specifically asks about a part of a group- I’m asking for a specific subset- look ahead and behinds. The syntax is different here. The answers in that other post do not directly apply. As the answers on this post suggest, you need to apply some work arounds to achieve this functionality that don’t apply to the supposed duplicate SO post.

Yu Chen
  • 6,540
  • 6
  • 51
  • 86
  • Hmmm... given the conversation that's likely been moved to chat, I think you should change the title to something like, **`Use case insensitivity exclusively inside of lookarounds.`** I can only imagine that you're bound to get several answers where people will misunderstand your question. – FailSafe Mar 17 '19 at 15:43
  • 1
    Good point, will do so now. – Yu Chen Mar 17 '19 at 15:44
  • @wiktor-stribiżew got a request for re-opening this quest in my queue. Seems the asker got an answer worthwhile of acceptance and another participant gave a comment suggesting to leave this open after reformulating the quest. As you seem to be an expert in RegEx, thought it would be good to pong this back to you: Can this be re-opened? – Ida Mar 18 '19 at 08:19
  • @IdaEbkes These are the same issues. Lookbehinds *are* part of regex patterns. The answers [there](https://stackoverflow.com/questions/1455160/how-to-set-ignorecase-flag-for-part-of-regular-expression-in-python) fully address the issue, show workarounds for older versions and provide the solutions for Python versions from 3.6 up. – Wiktor Stribiżew Mar 18 '19 at 08:21
  • @WiktorStribiżew thanks for your feedback, leaving closed as dup. Suppose more answers may be considered to be added to the dup, then. – Ida Mar 18 '19 at 08:28

3 Answers3

4

You can set the regex to case-insensitive globally with (?i) and switch a group to case-sensitive with (?-i:groupcontent):

regex = r'(?i)(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'

Instead of (?i), you can also use re.I in the search. The following is equivalent to the regex above:

regex = r'(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'
re.search(regex, mytext, re.I)
Endre Both
  • 5,540
  • 1
  • 26
  • 31
  • No, It won;'t work in `python 2`. `re.compile(r'(?i)(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)')` will give errors. – anubhava Mar 17 '19 at 15:48
  • @anubhava In Python 2 you can use the [regex](https://pypi.org/project/regex/) library (tested with 2.7.8). – Endre Both Mar 17 '19 at 15:55
2

Unfortunately python re module doesn't allow inline use of mode modifiers in the middle of a regex.

As a workaround, you may use this regex:

reg = re.compile(r'(?<=[Ii]\s[Ll][Oo][Vv][Ee]\s[Ee][Aa][Tt][Ii][Nn][Gg]\s)[a-z0-9]*(?=\s[Aa][Tt]\s[Hh][Oo][Mm][Ee])')

print "Case 1: ", reg.findall('I LOVE Eating popsicles at HOME.')

print "Case 2: ", reg.findall('I LOVE EATING popsicles at home.')

print "Case 3: ", reg.findall('I LOVE Eating Popsicles at HOME.')

Output:

Case 1:  ['popsicles']
Case 2:  ['popsicles']
Case 3:  []
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

Using (?i:...) you can set a regex a flag (in this case i) locally (inline) for some part of the regex.

Such a local flag setting is allowed also within lookbehind or lookahead, while keeping the rest of the regex without any option.

I modified your code, so it compliles the regex once and then calls is 2 times for different strings:

mytext1 = 'i LOVE eating Apples at HOME.'
mytext2 = 'i LOVE eating apples at HOME.'
pat = re.compile(r'(?<=(?i:I\sLOVE\sEATING\s))[a-z0-9]+(?=(?i:\sAT\sHOME))')
m = pat.search(mytext1)
print('1:', m.group() if m else '** Not found **')
m = pat.search(mytext2)
print('2:', m.group() if m else '** Not found **')

It prints:

1: ** Not found **
2: apples

so the match is only for the second source string.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41