Questions tagged [html5lib]

html5lib is a library for parsing and serializing HTML documents and fragments in Python, with ports to Dart, PHP, and Ruby.

html5lib is an open-source HTML parser for Python, based on the HTML specification. There are ports for PHP and Ruby (both unmaintained), as well as a third-party one for Dart.

107 questions

vote

1 answer

parse any HTML to XML using html5lib

I need to tidy up HTML pages and convert them to XML in Python; losing some "bad" parts if needed. I used TagSoup for some time, but it doesn't understand new "article", "footer" tags, and doesn't like "meta" when they are not in the head; making…

python xml html5lib

asked Nov 03 '14 at 15:27

alex29

vote

2 answers

bypassing a specific HTML sanitization in html5lib / bleach

I'm using bleach, which uses html5lib to clean user-generated content that are HTML fragments designed as dust.js templates everything has worked fine, except for this situation- input: {#loop} …

python html5lib

asked May 07 '14 at 22:00

Jonathan Vanasco

15,111
10
48
72

vote

1 answer

Beautifulsoup lost nodes

I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document. For example I tried to parse…

python beautifulsoup html5lib

asked May 01 '13 at 10:53

Martin Golpashin

1,032
9
28

votes

2 answers

How to parse HTML tables using html5lib and Beautiful Soup in Jupyter?

I'm Getting the value error trying to parse a page with BeautifulSoup and html5lib in Jupyter: import pandas as pd import requests import html5lib url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries" r =…

python parsing beautifulsoup jupyter html5lib

asked May 23 '23 at 21:21

Eugene

votes

2 answers

How can I get a data from in div class using a BeautifulSoup

I am learning BS4. I parsed some div class. But I want to get data in div code. ` [

python-3.x beautifulsoup lxml html5lib

asked Nov 01 '22 at 14:31

Baran Üyükuş

votes

0 answers

BeautifulSoup Returning Empty Brackets

from bs4 import BeautifulSoup import requests import html5lib url = 'https://twitter.com/st3phensparkman' result = requests.get(url) doc = BeautifulSoup(result.text, 'html5lib') followers = doc.find_all(text='Followers') print(followers) For…

parsing web beautifulsoup screen-scraping html5lib

asked Mar 24 '22 at 23:00

Stephen

votes

0 answers

missing in Selenium Python page_source

I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a tag. The unit test checks with…

python django selenium html5lib

asked Aug 25 '21 at 14:41

Deepstop

3,627
2
8
21

votes

2 answers

xml.etree.ElementTree: How to replace like "innerHTML"?

I want to replace the

tag of a html page. But the content of the heading can be HTML (not just a string). I want to insert foo bold bar input: start

bar italic

end Desired output: start

python html-parsing html5lib

asked Jul 09 '21 at 10:13

guettli

25,042
81
346
663

votes

1 answer

How to replace the innerHTML of all
tags with html5lib?

How to replace the innerHTML of all tags with html5lib? input: foo
Moonlight
bar Desired output: foo
Sunshine
bar I would like to use html5lib, since it is already a dependency.

python html-parsing html5lib

asked Jul 09 '21 at 08:29
guettli

25,042

81

346

663

votes

1 answer

"ValueError: No tables found matching regex '.+'" at random times when scraping large amounts of data

this is my first project with pandas and selenium so I may be making a dumb mistake. I've written this function to go through a list of nba players and scrape their game logs into data frames. It all works well but occasionally when I'm going…

python pandas selenium-chromedriver lxml html5lib

asked Feb 09 '21 at 21:22

Arslan Amir

votes

1 answer

Glitch in html5lib?

I'm getting this error. Is it a bug or is it a code error? What does it mean? Traceback (most recent call last): File "isc.py", line 8, in import requests, os, sys, bs4 File…

python html5lib

asked Dec 17 '20 at 14:50

Mayank

votes

3 answers

How to scrape the different content with the same html attributes and values?

I'm able to scrape a bunch of data from a webpage, but I'm struggling with extracting the specific content from subsections that have the exact same attributes and values. Here is the html:

     Relationship Issues
     …

python html web-scraping beautifulsoup html5lib

asked Oct 23 '20 at 05:21

Tom

votes

2 answers

Scraping multiple URLs using BeautifulSoup

I am trying to scrape a website, however, I was unable to complete the code so that I could insert several URLs at once. Currently the code is functional with one URL at a time, The current code is: import requests from bs4 import…

python beautifulsoup html5lib

asked Sep 18 '20 at 14:46

Sergio Curitiba

votes

1 answer

I am trying to click on expand button and then scrape the table

I am scraping a website table form https://csr.gov.in/companyprofile.php?year=FY+2015-16&CIN=L00000CH1990PLC010573 but I am not getting the exact result I am looking for. I want 11 columns from this link, "company name", "Class", "State", "Company…

python selenium-webdriver beautifulsoup html5lib

asked Jul 17 '20 at 16:48

Sumit Jha

votes

1 answer

Remove a bad tag completely with html5lib.sanitizer

I'm trying to use html5lib.sanitizer to clean user-input as suggested in the docs The problem is I want to remove bad tags completely and not just escape them (which seems like a bad idea anyway). The workaround suggested in the patch here doesn't…

python tokenize html-sanitizing html5lib sanitizer

asked May 17 '11 at 14:35

letoosh

Prev 1 2 3 4

6 7 8 Next

{name}