I wish to read in an XML file, find all sentences that contain both the markup <emotion>
and the markup <LOCATION>
, then print those entire sentences to a unique line. Here is a sample of the code:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
The regex here grabs all sentences with "wonderful" and "omaha" in them, and returns:
Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.
Which is perfect, but I really want to print all sentences that contain both <emotion>
and <LOCATION>
. For some reason, though, when I replace "wonderful" in the regex above with "emotion," the regex fails to return any output. So the following code yields no result:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
My question is: How can I modify my regular expression in order to grab only those sentences that contain both <emotion>
and <LOCATION>
? I would be most grateful for any help others can offer on this question.
(For what it's worth, I'm working on parsing my text in BeautifulSoup as well, but wanted to give regular expressions one last shot before throwing in the towel.)