How to remove HTML, Urls from with Python

Question

I have this list of xml files. Now I have to filter some labels out of it. The problem is the text, there is a lot of html mark up and urls in it and I need plain text. I would like to remove this elements in a loop and then append the cleaned text to my new list. This is what I have so far.

    data = []
    for conv in root.findall('./conversations/conversation'):
        pattern = re.compile( r'!\b(((ht|f)tp(s?))\://)?(www.|[a-z].)[a-z0-9\-\.]+\.)(\:[0-9]+)*(/($|[a-z0-9\.\,\;\?\\\\\\\+&amp;%\$#\=~_\-]+))*\b!i')
        if pattern.search(conv.text):
           re.sub(pattern, ' ')
           data.append(conv.text)

I can't find the right regex to remove things like this br />;<br /> and urls like this: http://neocash43.blog.com/2011/07/26/psp-sport-assessment-neopets-the-wand-of-wishing/</a>

Second problem is that with this xml root structure, I don't now how to append the cleaned conversation text to my new list.

I would suggest looking into beautifulsoup4, "a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree." — Max Power, Apr 12 '17 at 15:35
Are you confident of that URL? When I try to load it I get a 502 Bad Gateway. For my clarification, do you want to remove all of the HTML tags from a string that you have recovered from some xml? — Bill Bell, Apr 12 '17 at 15:35
@BillBell I'm so sorry, The Url was an example of an url I want to remove — Bambi, Apr 12 '17 at 15:37
@Szalbolcs Input text: '\n\t\t\tZafaras really have the finest hearing out of any other pet in Neopia.
;
;If you havent heard of Neopets than I have to significantly wonder what planet you arrive from, you surely dont hail from Neopia. , Output: Zafaras really have the finest hearing out of any other pet in Neopia.If you havent heard of Neopets than I have to significantly wonder what planet you arrive from, you surely dont hail from Neopia. — Bambi, Apr 12 '17 at 15:45

score 1 · Accepted Answer · answered Apr 13 '17 at 16:54

1

You could try http://pyparsing.wikispaces.com/file/view/htmlStripper.py/591745692/htmlStripper.py which uses the pyparsing library. I just used this script on my machine with Python 3.4.

answered Apr 13 '17 at 16:54

Bill Bell

21,021
5
43
58

Pyparsing is no longer hosted on wikispaces.com. Go to https://github.com/pyparsing/pyparsing – PaulMcG Aug 27 '18 at 12:53

score 0 · Answer 2 · answered Apr 12 '17 at 15:21

0

pattern.web python modules has an HTML to text function, which called plaintext. By default this function removes all HTML tags. For URLs use the existing RegEx.

answered Apr 12 '17 at 15:21

Panagiotis Simakis

1,245
1
18
45

Yes, I wanted to try that, but it is not compatible with Python 3.6. So I'm still stuck. – Bambi Apr 13 '17 at 10:19

How to remove HTML, Urls from with Python

2 Answers2