Questions tagged [html5lib]

html5lib is a library for parsing and serializing HTML documents and fragments in Python, with ports to Dart, PHP, and Ruby.

html5lib is an open-source HTML parser for Python, based on the HTML specification. There are ports for PHP and Ruby (both unmaintained), as well as a third-party one for Dart.

107 questions
2
votes
1 answer

Skip sanitization for videos in html5lib

I am using a wmd-editor in django, much like this one in which I am typing. I would like to allow the users to embed videos in it. For that I am using the Markdown video extension here. The problem is that I am also sanitizing user input using…
Ali
  • 1,256
  • 3
  • 15
  • 31
2
votes
2 answers

BeautifulSoup doesn't find correctly parsed elements

I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing. The HTML comes from this page: http://www.wvdnr.gov/ It contains multiple errors, like multiple , outside the…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/html" class="post-tag grid--cell" title="show questions tagged 'html'" rel="tag">html</a> <a href="../../questions/tagged/beautifulsoup" class="post-tag grid--cell" title="show questions tagged 'beautifulsoup'" rel="tag">beautifulsoup</a> <a href="../../questions/tagged/html-parsing" class="post-tag grid--cell" title="show questions tagged 'html-parsing'" rel="tag">html-parsing</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Nov 12 '14 at 21:06">asked Nov 12 '14 at 21:06</time> <a href="../../users/682797/mikk" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/682797.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Mikk" /> </a> <div class="s-user-card--info"> <a href="../../users/682797/mikk" class="s-user-card--link">Mikk</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">804</li> <li class="s-award-bling s-award-bling__silver" title="8 silver badges">8</li> <li class="s-award-bling s-award-bling__bronze" title="23 bronze badges">23</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-23174251"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/23174251/grabbing-different-elements-with-beautifulsoup-avoid-duplicating-in-nested-ele" class="question-hyperlink">Grabbing different elements with BeautifulSoup: avoid duplicating in nested elements</a></h3> <div class="excerpt">i want to grab different content (classes) from an lokal saved website (the python documentation) using BeautifulSoup4, so i use this code for doing that (index.html is this saved website: https://docs.python.org/3/library/stdtypes.html ) from bs4…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/beautifulsoup" class="post-tag grid--cell" title="show questions tagged 'beautifulsoup'" rel="tag">beautifulsoup</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Apr 19 '14 at 19:05">asked Apr 19 '14 at 19:05</time> <a href="../../users/3499451/svenwildermann" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/3499451.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="svenwildermann" /> </a> <div class="s-user-card--info"> <a href="../../users/3499451/svenwildermann" class="s-user-card--link">svenwildermann</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">631</li> <li class="s-award-bling s-award-bling__silver" title="6 silver badges">6</li> <li class="s-award-bling s-award-bling__bronze" title="20 bronze badges">20</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-2285086"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/2285086/how-to-install-html5lib-0-90-library-for-python-on-windows" class="question-hyperlink">How to install html5lib-0.90 library for Python on Windows?</a></h3> <div class="excerpt">I'm using Windows, and trying to install html5lib-0.90 library on python C:\>python C:\Users\Junior\Downloads\Python\html5lib-0.90\setup.py install Traceback (most recent call last): File "C:\Users\Junior\Downloads\Python\html5lib-0.90\setup.py",…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Feb 17 '10 at 22:58">asked Feb 17 '10 at 22:58</time> <a href="../../users/66708/junior-mayhe" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/66708.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Junior Mayhé" /> </a> <div class="s-user-card--info"> <a href="../../users/66708/junior-mayhe" class="s-user-card--link">Junior Mayhé</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">16,144</li> <li class="s-award-bling s-award-bling__gold" title="26 gold badges">26</li> <li class="s-award-bling s-award-bling__silver" title="115 silver badges">115</li> <li class="s-award-bling s-award-bling__bronze" title="161 bronze badges">161</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-20786980"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>2</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/20786980/using-html5lib-with-xml-etree-elementtree" class="question-hyperlink">using html5lib with xml.etree.ElementTree</a></h3> <div class="excerpt">I need is a way to use the html5lib parser to generate a real xml.etree.ElementTree. (lxml is not an option for portability reasons.) ELementTree.parse can take a parser as an optional parameter xml.etree.ElementTree.parse(source, parser=None) but…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/xhtml" class="post-tag grid--cell" title="show questions tagged 'xhtml'" rel="tag">xhtml</a> <a href="../../questions/tagged/elementtree" class="post-tag grid--cell" title="show questions tagged 'elementtree'" rel="tag">elementtree</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Dec 26 '13 at 15:10">asked Dec 26 '13 at 15:10</time> <a href="../../users/1180926/arithmomaniac" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/1180926.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Arithmomaniac" /> </a> <div class="s-user-card--info"> <a href="../../users/1180926/arithmomaniac" class="s-user-card--link">Arithmomaniac</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">4,604</li> <li class="s-award-bling s-award-bling__gold" title="3 gold badges">3</li> <li class="s-award-bling s-award-bling__silver" title="38 silver badges">38</li> <li class="s-award-bling s-award-bling__bronze" title="58 bronze badges">58</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-16134384"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/16134384/beautifulsoup-functionality-not-working-properly-in-specific-scenario" class="question-hyperlink">Beautifulsoup functionality not working properly in specific scenario</a></h3> <div class="excerpt">I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect. It reads the following data in: <!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/beautifulsoup" class="post-tag grid--cell" title="show questions tagged 'beautifulsoup'" rel="tag">beautifulsoup</a> <a href="../../questions/tagged/urllib2" class="post-tag grid--cell" title="show questions tagged 'urllib2'" rel="tag">urllib2</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Apr 21 '13 at 17:57">asked Apr 21 '13 at 17:57</time> <a href="../../users/760626/bmiskie" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/760626.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="bmiskie" /> </a> <div class="s-user-card--info"> <a href="../../users/760626/bmiskie" class="s-user-card--link">bmiskie</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">617</li> <li class="s-award-bling s-award-bling__silver" title="8 silver badges">8</li> <li class="s-award-bling s-award-bling__bronze" title="22 bronze badges">22</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-12927265"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status "> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/12927265/pub-install-fail-package-html5lib-doesn-t-have-a-pubspec-yaml-file" class="question-hyperlink">Pub install fail, Package "html5lib" doesn't have a pubspec.yaml file</a></h3> <div class="excerpt">I've created a simple project. This is my pubspec.yaml name: testapp description: test application dependencies: html5lib: 0.0.12 And now i get this error Pub install fail, Resolving dependencies... Package "html5lib" doesn't have a…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/dart" class="post-tag grid--cell" title="show questions tagged 'dart'" rel="tag">dart</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> <a href="../../questions/tagged/dart-pub" class="post-tag grid--cell" title="show questions tagged 'dart-pub'" rel="tag">dart-pub</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Oct 17 '12 at 04:44">asked Oct 17 '12 at 04:44</time> <a href="../../users/693442/juri-krainjukov" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/693442.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Juri Krainjukov" /> </a> <div class="s-user-card--info"> <a href="../../users/693442/juri-krainjukov" class="s-user-card--link">Juri Krainjukov</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">732</li> <li class="s-award-bling s-award-bling__silver" title="8 silver badges">8</li> <li class="s-award-bling s-award-bling__bronze" title="27 bronze badges">27</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-12253791"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/12253791/html5lib-with-lxml-treebuilder-doesn-t-parse-namespaces-correctly" class="question-hyperlink">html5lib with lxml treebuilder doesn't parse namespaces correctly</a></h3> <div class="excerpt">I'm trying to parse some HTML content with html5lib using the lxml treebuilder. Note: I'm using the requests library to grab the content and the content is HTML5 (tried with XHTML - same result). When I simply output the HTML source, it looks…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/lxml" class="post-tag grid--cell" title="show questions tagged 'lxml'" rel="tag">lxml</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Sep 03 '12 at 20:41">asked Sep 03 '12 at 20:41</time> <a href="../../users/1639699/alexei" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/1639699.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Alexei" /> </a> <div class="s-user-card--info"> <a href="../../users/1639699/alexei" class="s-user-card--link">Alexei</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">672</li> <li class="s-award-bling s-award-bling__gold" title="1 gold badge">1</li> <li class="s-award-bling s-award-bling__silver" title="5 silver badge">5</li> <li class="s-award-bling s-award-bling__bronze" title="13 bronze badge">13</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-1122494"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/1122494/which-revision-of-html5lib-is-stable" class="question-hyperlink">Which revision of html5lib is stable?</a></h3> <div class="excerpt">html5lib notes that it's latest release (0.11) is somewhat old. Using the Python portion, I have recursion problems as noted in Issue 70 and Issue 59 but can't find a recent Mercurial revision that is stable. The latest tip is no good, I got the…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/html5lib" class="post-tag grid--cell" title="show questions tagged 'html5lib'" rel="tag">html5lib</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Jul 13 '09 at 22:42">asked Jul 13 '09 at 22:42</time> <a href="../../users/42974/mat" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/42974.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Mat" /> </a> <div class="s-user-card--info"> <a href="../../users/42974/mat" class="s-user-card--link">Mat</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">82,161</li> <li class="s-award-bling s-award-bling__gold" title="34 gold badges">34</li> <li class="s-award-bling s-award-bling__silver" title="89 silver badges">89</li> <li class="s-award-bling s-award-bling__bronze" title="109 bronze badges">109</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-9107649"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>1</strong></span> <div class="viewcount">vote</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/9107649/what-is-going-on-with-this-html5lib-script" class="question-hyperlink">What is going on with this html5lib script?</a></h3> <div class="excerpt">Trying to process a very simple html5 script and render it using html5lib import html5lib html = '''<!DOCTYPE html> <html lang="en"> <head> <title>Hi
schwa
  • 11,962
  • 14
  • 43
  • 54
1
vote
1 answer

Xpath doesn't match

I'm trying to get some elements from a page. Unfortunately it results with an empty list. The pretty-printed tree includes this element: ... However when I do this on the same…
viraptor
  • 33,322
  • 10
  • 107
  • 191
1
vote
1 answer

How to parse HTML with source mapping?

I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup. For example, given the HTML markup (with \n EOL chars)
midrare
  • 2,371
  • 28
  • 48
1
vote
2 answers

Scraping a table from website using python and trying to get the hyperlink of content with text

I am learning python, I am trying to scrape a table from https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html website. In this table you can see there are 4 columns "CIN", Company Name", "Roc" and "Status". As you can see…
1
vote
1 answer

How can I get the content of body element by using html5lib in Python?

How can I get the content of element by using html5lib in Python? Example input data: xxxyyy Expected output: xxxyyy It should work even if HTML is broken (unclosed tags,...).
sorin
  • 161,544
  • 178
  • 535
  • 806
1
vote
0 answers

Error running beautifulsoup (module 'html5lib.treebuilders' has no attribute '_base')

I am new to programming and Python. I am trying to install BeutifulSoup on Python3 to learn web scraping (using Jupyter Notebooks as an IDE) for MOOC. When I run from bs4 import BeautifulSoup I receive the following error AttributeError …