Python regex to print all sentences that contain two identified classes of markup

Question

I wish to read in an XML file, find all sentences that contain both the markup <emotion> and the markup <LOCATION>, then print those entire sentences to a unique line. Here is a sample of the code:

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

The regex here grabs all sentences with "wonderful" and "omaha" in them, and returns:

Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.

Which is perfect, but I really want to print all sentences that contain both <emotion> and <LOCATION>. For some reason, though, when I replace "wonderful" in the regex above with "emotion," the regex fails to return any output. So the following code yields no result:

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

My question is: How can I modify my regular expression in order to grab only those sentences that contain both <emotion> and <LOCATION>? I would be most grateful for any help others can offer on this question.

(For what it's worth, I'm working on parsing my text in BeautifulSoup as well, but wanted to give regular expressions one last shot before throwing in the towel.)

If you're working with markup, XML parsers or more specifically, looking into [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) could prove to be a lifesaver. — jedwards, Jun 19 '13 at 23:11
Indeed I am working with xml files, and am parsing with the BeautifulSoup package. I wanted to cut some corners with regular expressions, but that might prove impossible. Either way, thank you for the tip. — duhaime, Jun 19 '13 at 23:25
At a glance, the thing that jumps out at me, is that you have a lookahead looking for a space (`\s`) right after "emotion". The cheap solution might be to add a `>` so that it can find it's space, like: `... *?\bemotion>(?=\s|\ ... `. That's untested, so other problems may yet exist. Frankly, your regex is kinda incomprehensible, and I'd say it's probably not a very effective corner to cut. — femtoRgon, Jun 19 '13 at 23:30
Thank you, femtoRgon! You were spot on! I'd uprate your comment if I had the power to do so. — duhaime, Jun 19 '13 at 23:40
Yes, I just tested it as well, and it seems that really was the only problem, so I've posted it as an answer instead. I'm glad that's worked out. — femtoRgon, Jun 19 '13 at 23:45

score 1 · Accepted Answer · answered Jun 19 '13 at 23:43

Your problem appears to be that your regex is expecting a space (\s) to follow the matching word, as seen with:

emotion(?=\s|\.|$)

Since when it's part of a tag, it's followed by a >, rather than a space, no match is found since that lookahead fails. To fix it, you can just add the > after emotion, like:

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

Upon testing, this seems to solve your problem. Make sure and treat "LOCATION" similarly:

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

score 0 · Answer 2 · answered Jun 19 '13 at 23:26

0

If I do not understand bad what you are trying to do is remove <emotion> </emotion> <LOCATION></LOCATION> ??

Well if is that what you want to do you can do this

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

def remove_xml_tags(xml):
    content = re.compile(r'<.*?>')
    return content.sub('', xml)

data = remove_xml_tags(text)

out.write(data + '\n')

out.close()

answered Jun 19 '13 at 23:26

Victor Castillo Torres

10,581
7
40
50

Thank you, Victor. Unfortunately, I do not want to remove the markup. I want instead to find all sentences that contain the and markup. Then I want to print each of those sentences (in their entirety) to a unique line. Nonetheless, I thank you for your suggested code! – duhaime Jun 19 '13 at 23:29

score 0 · Answer 3 · answered Jun 22 '13 at 15:53

I have just discovered that the regex may be bypassed altogether. To find (and print) all sentences that contain two identified classes of markup, you can use a simple for loop. In case it might help others who find themselves where I found myself, I'll post my code:

# read in your file
f = open('sampleinput.txt', 'r')

# use read method to convert the read data object into string
readfile = f.read()

#########################
# now use the replace() method to clean data
#########################

# replace all \n with " "
nolinebreaks = readfile.replace('\n', ' ')

# replace all commas with ""
nocommas = nolinebreaks.replace(',', '')

# replace all ? with .
noquestions = nocommas.replace('?', '.')

# replace all ! with .
noexclamations = noquestions.replace('!', '.')

# replace all ; with .
nosemicolons = noexclamations.replace(';', '.')

######################
# now use replace() to get rid of periods that don't end sentences
######################

# replace all Mr. with Mr
nomisters = nosemicolons.replace('Mr.', 'Mr') 

#replace 'Mrs.' with 'Mrs' etc. 

cleantext = nomisters

#now, having cleaned the input, find all sentences that contain your two target words. To find markup, just replace "Toby" and "pipe" with <markupclassone> and <markupclasstwo>

periodsplit = cleantext.split('.')
for x in periodsplit:
    if 'Toby' in x and 'pipe' in x:
        print x

Python regex to print all sentences that contain two identified classes of markup

3 Answers3