0

I'm trying to parse a page using my python script. But <nobr> tag along with '&' is giving me trouble. Here the actual html.

<A HREF="http://enpass.in/algo/c12.html" CLASS="style"> <NOBR>Simulation for 1st & 2nd path</NOBR></A>

Now my handle_data function of my parser(using sgmllib) is not able to handle the data properly. Here is the handle_data code.

def handle_data(self, data):
        self.datainfo.append(data)

I expect datainfo array to be have only one element namely "Simulation for 1st & 2nd path"

However, when I print the datainfo array, the actual contents of datainfo array are 7 in number.

datainfo -> ['', '', 'Simulation for 1st', '&', '2nd path', '', '']

Whats happening?

Neo
  • 13,179
  • 18
  • 55
  • 80

1 Answers1

0

You need to encode the ampersand, like &amp; to become valid HTML.

Bjorn
  • 5,272
  • 1
  • 24
  • 35
  • Any idea of why those empty characters are coming? – Neo Feb 18 '11 at 09:02
  • Seems like it's one for each element: [ '', -> A, '', -> NOBR, 'value', '', -> /NOBR, '', -> /A ] – Bjorn Feb 18 '11 at 09:05
  • Just out of curiosity, why are you using sgmlib? It's deprecated in 2.6 and removed in 3.0. Why didn't you choose something like BeautifulSoup? Or, if it's just the value (Simulation ...) you want, why not use a regular expression to strip all HTML? – Bjorn Feb 18 '11 at 09:09