html5lib makes BeautifulSoup miss an element

Question

Contiuing my attempt to pull transcripts from the Presidential debates, I've no started using html5lib as a parser with BeautifulSoup.

But, now when I run (previously working) code to find the element with the actual transcript it errors out and claims not to find any such span.

Here's the code:

from bs4 import BeautifulSoup
import html5lib
import urllib

file = urllib.urlopen('http://www.presidency.ucsb.edu/ws/index.php?pid=111395')
soup = BeautifulSoup(file, "html5lib")
transcript = soup.find_all("span", class_="displaytext")[0]

And here's the error:

IndexError                                
Traceback (most recent call last)
<ipython-input-5-2c227e8c4a25> in <module>()
  1 file = urllib.urlopen('http://www.presidency.ucsb.edu/ws/index.php?pid=111395')
  2 soup = BeautifulSoup(file, "html5lib")
----> 3 transcript = soup.find_all("span", class_="displaytext")[0]

IndexError: list index out of range

And here's the relevant part of the page I'm calling, proving I'm not crazy, there is a span with class 'displaytext'

 <span class="displaytext">
           <b>
            PARTICIPANTS:
           </b>
           <br/>
           Former Governor Jeb Bush (FL);

What am I missing? If I run this without calling "html5lib" in the soup call, it works fine (but I get later errors due to spurious fake tag calls with no corresponding closing tag).

Can you print soup and see what you actually got? Because I ran your exact code and got what you are after without an error. — durdenk, May 05 '16 at 22:56
Can you add the details of your setup; OS being used, python distribution (i.e. is it anaconda or anything like that)? — bmcculley, May 06 '16 at 17:55
Executing the same exact code and not getting an error. What `html5lib` and `beautifulsoup4` versions are you using? Thanks. — alecxe, May 10 '16 at 02:01
I'll answer as best I can, but this is fairly new to me: I'm running on wakar.io for web-based iPython Notebooks. My environment is np18py27-1.9. And I pip installed both very recently. FWIW, I was able to get my code to work by NOT using `html5lib` but instead stripping the tag out at the file level: ```file = urllib.urlopen(debate_links[i]).read() file_stripped = file.replace('', '') file_stripped = file_stripped.replace('', '') soup = BeautifulSoup(file_stripped) ``` — ScottieB, May 17 '16 at 21:22

html5lib makes BeautifulSoup miss an element

0 Answers0

Linked