1

I'm currently trying to get data from an html file. It appears that the code I'm using works, but not as I expect. I can get some items but not all and I'm wondering if it has to do with the size of the file I'm attempting to read.

I'm currently trying to parse the source of this webpage.

This page is 4500 lines long so it is a pretty good size. I've been using this page as I'd like to make sure the code works on large files.

The code I'm using is:

import lxml.html
import lxml
import urllib2

webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
    print element.text_content()

This gives the expected output when I use an element_id of 'mm3' or something near the top but if I use the element_id of 'productDetails' I get no output. At least I do on my current setup.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
pri0ritize
  • 554
  • 2
  • 7
  • 19

1 Answers1

1

I'm afraid lxml.html cannot handle parsing this particular HTML source. It parses the h3 tag with id="productDetails" as an empty element (and this is in a default "recover" mode):

<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>

Switch to BeautifulSoup with html5lib parser (it is extremely lenient):

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')

for element in soup.find(id='productDetails').find_all():
    print element.text

Prints:

Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.

outrunner

...
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you very much for the help! I'll go ahead and try and use the other answer. I didn't realize that an empty element was the default recover mode. I wish I had read a little deeper and known that prior to spending a few hours trying to solve it myself! – pri0ritize Dec 26 '14 at 07:33
  • @pri0ritize sure, thanks. FYI, I mentioned the `recover` mode just to point out that `lxml.html` uses it by default and there is no easy way to tell it be more lenient. – alecxe Dec 26 '14 at 07:34
  • I understand completely. I just didn't catch that in the documentation. That's a huge help because I was seeing this empty element quite a bit and couldn't figure it out. – pri0ritize Dec 26 '14 at 07:35