0

I have this list of xml files. Now I have to filter some labels out of it. The problem is the text, there is a lot of html mark up and urls in it and I need plain text. I would like to remove this elements in a loop and then append the cleaned text to my new list. This is what I have so far.

    data = []
    for conv in root.findall('./conversations/conversation'):
        pattern = re.compile( r'!\b(((ht|f)tp(s?))\://)?(www.|[a-z].)[a-z0-9\-\.]+\.)(\:[0-9]+)*(/($|[a-z0-9\.\,\;\?\\\\\\\+&%\$#\=~_\-]+))*\b!i')
        if pattern.search(conv.text):
           re.sub(pattern, ' ')
           data.append(conv.text)    

I can't find the right regex to remove things like this br />;<br /> and urls like this: http://neocash43.blog.com/2011/07/26/psp-sport-assessment-neopets-the-wand-of-wishing/</a>

Second problem is that with this xml root structure, I don't now how to append the cleaned conversation text to my new list.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
Bambi
  • 715
  • 2
  • 8
  • 19

2 Answers2

1

You could try http://pyparsing.wikispaces.com/file/view/htmlStripper.py/591745692/htmlStripper.py which uses the pyparsing library. I just used this script on my machine with Python 3.4.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

pattern.web python modules has an HTML to text function, which called plaintext. By default this function removes all HTML tags. For URLs use the existing RegEx.

Panagiotis Simakis
  • 1,245
  • 1
  • 18
  • 45
  • Yes, I wanted to try that, but it is not compatible with Python 3.6. So I'm still stuck. – Bambi Apr 13 '17 at 10:19