0

I have a pandas dataframe with multiple columns. I am working on a specific column named "Text_annotated" whose structure is like :

Text_annotated
<html> Lorem ipsum dolor sit amet, <phrase>consectetur adipiscing elit</phrase>, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <phrase>Ut enim ad minim veniam</phrase>, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</html>
<html> Faucibus vitae aliquet nec ullamcorper sit amet risus nullam. Pellentesque sit amet porttitor eget dolor morbi. <phrase>Tincidunt praesent semper feugiat nibh sed pulvinar. Lobortis elementum nibh tellus molestie nunc non blandit.</phrase> Tellus at urna condimentum mattis.</html>
<html>Pulvinar etiam non quam lacus. Amet purus gravida quis blandit. Scelerisque eu ultrices vitae auctor eu augue ut. Tincidunt lobortis feugiat vivamus at augue eget arcu dictum varius. Pellentesque adipiscing commodo elit at imperdiet.</html>

and I want to extract only the text between the <phrase></phrase> tags. For this reason, I decided to use PyQuery. So far I have tried

text_phrases= df['Text_annotated'].tolist()
doc = pq(f"{text_phrases}")
phrase_macro = doc.find("phrase").text()

which returns a pyquery.pyquery.PyQuery where each "newline" contains only one result e.g.

consectetur adipiscing elit
Ut enim ad minim veniam
Tincidunt praesent semper feugiat nibh sed pulvinar. Lobortis elementum nibh tellus molestie nunc non blandit.

Thus, my question is whether it's possible to group the results for each row in the df separated by a comma e.g.

consectetur adipiscing elit, Ut enim ad minim veniam
Tincidunt praesent semper feugiat nibh sed pulvinar. Lobortis elementum nibh tellus molestie nunc non blandit.

(I have also tried to iterate over the objects phrases_res = [h.text() for h in doc('phrase').items()] which didn't work)

Any help/suggestion is much appreciated.

PS. Each row is just wrapped with a <html> tag, without any other particular structure.

EDIT: Tried also to "separate" somehow according to the html tag, but returned the previous result.

rows = doc('html')
for row in rows.text():
    phrase_res = doc.find("phrase").text()
new_df['Phrases_res'] = phrase_res
new_df.head(5)
Marrluxia
  • 61
  • 1
  • 9

1 Answers1

0

You can use pandas.Series.str.findall with a regex expression to return a list of all the strings between two delimiters.

Try this :

import pandas as pd

pd.options.display.max_colwidth = None

data = ['<html> Lorem ipsum dolor sit amet, <phrase>consectetur adipiscing elit</phrase>, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <phrase>Ut enim ad minim veniam</phrase>, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</html>',
        '<html> Faucibus vitae aliquet nec ullamcorper sit amet risus nullam. Pellentesque sit amet porttitor eget dolor morbi. <phrase>Tincidunt praesent semper feugiat nibh sed pulvinar. Lobortis elementum nibh tellus molestie nunc non blandit.</phrase> Tellus at urna condimentum mattis.</html>',
        '<html>Pulvinar etiam non quam lacus. Amet purus gravida quis blandit. Scelerisque eu ultrices vitae auctor eu augue ut. Tincidunt lobortis feugiat vivamus at augue eget arcu dictum varius. Pellentesque adipiscing commodo elit at imperdiet.</html>']

df = pd.DataFrame(data, columns=['Text_annotated'])

df['Phrases'] = df['Text_annotated'].str.findall(r"<phrase>(.*?)</phrase>")

>>> display(df)

enter image description here

Timeless
  • 22,580
  • 4
  • 12
  • 30
  • 1
    Thank you for your answer. Tbh some tags have also attributes and other nested tags, which complicates the regex. (I put the simplest example in the df above, sorry.) + I am curious to see whether it's possible with pyquery. – Marrluxia Sep 14 '22 at 20:47