I have this list of xml files. Now I have to filter some labels out of it. The problem is the text, there is a lot of html mark up and urls in it and I need plain text. I would like to remove this elements in a loop and then append the cleaned text to my new list. This is what I have so far.
data = []
for conv in root.findall('./conversations/conversation'):
pattern = re.compile( r'!\b(((ht|f)tp(s?))\://)?(www.|[a-z].)[a-z0-9\-\.]+\.)(\:[0-9]+)*(/($|[a-z0-9\.\,\;\?\\\\\\\+&%\$#\=~_\-]+))*\b!i')
if pattern.search(conv.text):
re.sub(pattern, ' ')
data.append(conv.text)
I can't find the right regex to remove things like this br />;<br />
and urls like this: http://neocash43.blog.com/2011/07/26/psp-sport-assessment-neopets-the-wand-of-wishing/</a>
Second problem is that with this xml root structure, I don't now how to append the cleaned conversation text to my new list.
;
;If you havent heard of Neopets than I have to significantly wonder what planet you arrive from, you surely dont hail from Neopia. , Output: Zafaras really have the finest hearing out of any other pet in Neopia.If you havent heard of Neopets than I have to significantly wonder what planet you arrive from, you surely dont hail from Neopia. – Bambi Apr 12 '17 at 15:45