0

I'm pulling text from the Presidential debates. I got to one that has an issue: it errantly turns every mention of the word "debate" into a tag<debate>. Go ahead, search for "Welcome back to the Republican presidential"; notice an obvious word missing?

Cool, so BeautifulSoup does a superb job of cleaning up messy HTML and adding closing tags were they should have been. But in this case, that mucks me up, because <debate> is now a child of a <p> and the closing </debate> is added allllll the way at the end; thus nesting the remaining debate inside that tag.

How do I tell BeautifulSoup to either ignore or remove <debate>? Or alternatively, how do I add a closing tag immediately after? I've tried unwrap, but by the time I can call it, BS has already set up the closing tag at the end, and thus made following paragraphs children rather than siblings.

Here's how I'm set up:

from bs4 import BeautifulSoup
import urllib

bad_debate = 'http://www.presidency.ucsb.edu/ws/index.php?pid=111395'
file = urllib.urlopen(bad_debate)
soup = BeautifulSoup(file)

My hunch is I need to insert something between the url call and BeautifulSoup, but for the life of me I can't figure out how to modify the file contents.

ScottieB
  • 3,958
  • 6
  • 42
  • 60

1 Answers1

2

html5lib parser does a better job (than lxml or html.parser) handling the debate element in this case:

soup = BeautifulSoup(file, "html5lib")

Here is how it handles the mentioned part of the debate:

<p>
    <b>
     BARTIROMO:
    </b>
    Welcome back to the Republican presidential
    <debate>
     here in North Charleston. Right back to the questions. [
     <i>
      applause
     </i>
     ]
    </debate>
</p>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • It doesn't error out, but now it won't use lines that worked on other transcripts fine. eg transcript = soup.find_all("span", class_="displaytext")[0] (out of bounds), when I run print soup.prettify() I can see the span I'm trying to call, but find_all won't grab it. – ScottieB May 04 '16 at 13:33
  • @ScottieB could you please create a separate question providing the code you have so far and describing the symptoms? Throw me a link here. Thanks. – alecxe May 04 '16 at 13:35
  • Thanks @alecxe for the suggestion, posted: http://stackoverflow.com/questions/37052097/html5lib-makes-beautifulsoup-miss-an-element – ScottieB May 05 '16 at 13:36