Handle tag in python sgmllib

Question

I'm trying to parse a page using my python script. But <nobr> tag along with '&' is giving me trouble. Here the actual html.

<A HREF="http://enpass.in/algo/c12.html" CLASS="style"> <NOBR>Simulation for 1st & 2nd path</NOBR></A>

Now my handle_data function of my parser(using sgmllib) is not able to handle the data properly. Here is the handle_data code.

def handle_data(self, data):
        self.datainfo.append(data)

I expect datainfo array to be have only one element namely "Simulation for 1st & 2nd path"

However, when I print the datainfo array, the actual contents of datainfo array are 7 in number.

datainfo -> ['', '', 'Simulation for 1st', '&', '2nd path', '', '']

Whats happening?

Er, `urllib2` doesn't do any HTML parsing. What are you actually using? — Daniel Roseman, Feb 18 '11 at 08:32
Just for curiosity: you're using urllib2 as a html parser? How? — Herberth Amaral, Feb 18 '11 at 08:32

score 0 · Answer 1 · answered Feb 18 '11 at 08:52

0

You need to encode the ampersand, like & to become valid HTML.

answered Feb 18 '11 at 08:52

Bjorn

Any idea of why those empty characters are coming? – Neo Feb 18 '11 at 09:02
Seems like it's one for each element: [ '', -> A, '', -> NOBR, 'value', '', -> /NOBR, '', -> /A ] – Bjorn Feb 18 '11 at 09:05
Just out of curiosity, why are you using sgmlib? It's deprecated in 2.6 and removed in 3.0. Why didn't you choose something like BeautifulSoup? Or, if it's just the value (Simulation ...) you want, why not use a regular expression to strip all HTML? – Bjorn Feb 18 '11 at 09:09

1 Answers1