I have a pandas dataframe with multiple columns. I am working on a specific column named "Text_annotated" whose structure is like :
Text_annotated |
---|
<html> Lorem ipsum dolor sit amet, <phrase>consectetur adipiscing elit</phrase>, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <phrase>Ut enim ad minim veniam</phrase>, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</html> |
<html> Faucibus vitae aliquet nec ullamcorper sit amet risus nullam. Pellentesque sit amet porttitor eget dolor morbi. <phrase>Tincidunt praesent semper feugiat nibh sed pulvinar. Lobortis elementum nibh tellus molestie nunc non blandit.</phrase> Tellus at urna condimentum mattis.</html> |
<html>Pulvinar etiam non quam lacus. Amet purus gravida quis blandit. Scelerisque eu ultrices vitae auctor eu augue ut. Tincidunt lobortis feugiat vivamus at augue eget arcu dictum varius. Pellentesque adipiscing commodo elit at imperdiet.</html> |
and I want to extract only the text between the <phrase></phrase>
tags. For this reason, I decided to use PyQuery
. So far I have tried
text_phrases= df['Text_annotated'].tolist()
doc = pq(f"{text_phrases}")
phrase_macro = doc.find("phrase").text()
which returns a pyquery.pyquery.PyQuery
where each "newline" contains only one result e.g.
consectetur adipiscing elit
Ut enim ad minim veniam
Tincidunt praesent semper feugiat nibh sed pulvinar. Lobortis elementum nibh tellus molestie nunc non blandit.
Thus, my question is whether it's possible to group the results for each row in the df separated by a comma e.g.
consectetur adipiscing elit, Ut enim ad minim veniam
Tincidunt praesent semper feugiat nibh sed pulvinar. Lobortis elementum nibh tellus molestie nunc non blandit.
(I have also tried to iterate over the objects phrases_res = [h.text() for h in doc('phrase').items()]
which didn't work)
Any help/suggestion is much appreciated.
PS. Each row is just wrapped with a <html>
tag, without any other particular structure.
EDIT: Tried also to "separate" somehow according to the html
tag, but returned the previous result.
rows = doc('html')
for row in rows.text():
phrase_res = doc.find("phrase").text()
new_df['Phrases_res'] = phrase_res
new_df.head(5)