Questions tagged [html5lib]

html5lib is a library for parsing and serializing HTML documents and fragments in Python, with ports to Dart, PHP, and Ruby.

html5lib is an open-source HTML parser for Python, based on the HTML specification. There are ports for PHP and Ruby (both unmaintained), as well as a third-party one for Dart.

107 questions
1
vote
1 answer

How to check which line from HTML triggers error?

I have the following code that removes duplicates paragraphs from html file. from bs4 import BeautifulSoup fp = open("Input.html", "rb") soup = BeautifulSoup(fp, "html5lib") elms = [] for elem in soup.find_all('font'): if elem not in elms: …
Ger Cas
  • 2,188
  • 2
  • 18
  • 45
1
vote
1 answer

What exactly is a BS4 'element', how are elements counted, which parser gets to decide? Obviously confused

I am now confused by something I thought I understood, but turns out I've been taking for granted. One frequently encounters this type of for loop: from bs4 import BeautifulSoup as bs mystring = 'some string' soup = bs(mystring,'html.parser') for…
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
1
vote
0 answers

Parsing html using lxml and html5lib, getting "TypeError: insertDoctype() takes exactly 4 arguments (2 given)"

I'm getting the error TypeError: insertDoctype() takes exactly 4 arguments (2 given) when using lxml and html5lib together. It seems that the insertDoctype method in lxml.html._html5builder.TreeBuilder (link) takes 4 args, while the html5lib code…
lmz
  • 1,560
  • 1
  • 9
  • 19
1
vote
2 answers

incompatible numpy and html5lib for tensorflow

tensorflow 1.7.0 has requirement numpy>=1.13.3, but you'll have numpy 1.11.0 which is incompatible. tensorboard 1.7.0 has requirement html5lib==0.9999999, but you'll have html5lib 0.999 which is incompatible. tensorboard 1.7.0 has requirement…
Sew
  • 11
  • 3
1
vote
1 answer

How to import from html5lib.sanitizer

I am trying to import module HTMLSanitizerMixin from module html5lib.sanitizer in Python. After searching the web, I see that in the update for html5lib they removed the sanitizer package, but I can't seem to get it now even when I try to import it…
1
vote
1 answer

Error using pip - module 'pip._vendor.html5lib' has no attribute 'parse'

This error popped up today while trying to install some packages using pip. Python version - 3.5.4 pip install pytesseract It gives the following exception : Collecting pytesseractException: Traceback (most recent call last): File…
msr
  • 33
  • 5
1
vote
2 answers

AttributeError: 'ResultSet' object has no attribute 'find_all' - pd.read_html

I am trying to extract the data from a table from a webpage, but keep receiving the above error. I have looked at the examples on this site, as well as others, but none deal directly with my problem. Please see code below: from bs4 import…
aLoHa
  • 165
  • 7
1
vote
0 answers

Needed: Example of replacing html5lib sanitizer

djangocms_text_ckeditor references the html5lib sanitizer function, which has been deprecated. I expect that there is a way to rewrite this code without sanitizer. from html5lib import…
RandO
  • 315
  • 3
  • 13
1
vote
1 answer

Can anyone explain why I am getting this error [ImportError: lxml not found, please install it]

I am trying to use use the .read_html() function in the pandas library and keep getting this error when I run the code in the shell. I saw that you need to install the lxml so I did that using apt-get. But afterwards when I tried to run it again I…
Mark
  • 1,051
  • 3
  • 13
  • 17
1
vote
0 answers

html5lib makes BeautifulSoup miss an element

Contiuing my attempt to pull transcripts from the Presidential debates, I've no started using html5lib as a parser with BeautifulSoup. But, now when I run (previously working) code to find the element with the actual transcript it errors out and…
ScottieB
  • 3,958
  • 6
  • 42
  • 60
1
vote
5 answers

html5lib/lxml examples for BeautifulSoup users?

I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators. By looking at the docs…
Chris Curvey
  • 9,738
  • 10
  • 48
  • 70
1
vote
1 answer

How to remove namespace value from inside lxml.html.html5paser element tag

Is it possible not to add namespace for the tag when using html5parser from the lxml.html package? Example: from lxml import html print(html.parse('http://example.com').getroot().tag) # You will get 'html' from lxml.html import…
Renat
  • 417
  • 4
  • 12
1
vote
2 answers

disable comments check for '--' in lxml

Use Case: Fail parse https://www.banca-romaneasca.ro/en/tools-and-resources/ with lxml. ... /opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment self.tree.insertComment(token,…
Andrei.Danciuc
  • 1,000
  • 10
  • 24
1
vote
1 answer

Make a Django model form safe

I am building a high performance API. I have been using Tastypie for ages and sometimes I just need more simplicity. For this API I have decided to use Django Simple Rest (https://github.com/croach/django-simple-rest). It provides the base of what…
Rich
  • 1,769
  • 3
  • 20
  • 30
1
vote
2 answers

Validate an HTML fragment using html5lib

I'm using Python and html5lib to check if a bit of HTML code entered on a form field is valid. I tried the following code to test a valid fragment but I'm getting an unexpected error (at least for me): >>> import html5lib >>> from html5lib.filters…
hvelarde
  • 2,875
  • 14
  • 34