Questions tagged [html5lib]

html5lib is a library for parsing and serializing HTML documents and fragments in Python, with ports to Dart, PHP, and Ruby.

html5lib is an open-source HTML parser for Python, based on the HTML specification. There are ports for PHP and Ruby (both unmaintained), as well as a third-party one for Dart.

107 questions
1
vote
1 answer

parse any HTML to XML using html5lib

I need to tidy up HTML pages and convert them to XML in Python; losing some "bad" parts if needed. I used TagSoup for some time, but it doesn't understand new "article", "footer" tags, and doesn't like "meta" when they are not in the head; making…
alex29
  • 73
  • 7
1
vote
2 answers

bypassing a specific HTML sanitization in html5lib / bleach

I'm using bleach, which uses html5lib to clean user-generated content that are HTML fragments designed as dust.js templates everything has worked fine, except for this situation- input: {#loop} …
Jonathan Vanasco
  • 15,111
  • 10
  • 48
  • 72
1
vote
1 answer

Beautifulsoup lost nodes

I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document. For example I tried to parse…
Martin Golpashin
  • 1,032
  • 9
  • 28
0
votes
2 answers

How to parse HTML tables using html5lib and Beautiful Soup in Jupyter?

I'm Getting the value error trying to parse a page with BeautifulSoup and html5lib in Jupyter: import pandas as pd import requests import html5lib url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries" r =…
Eugene
  • 13
  • 6
0
votes
2 answers

How can I get a data from in div class using a BeautifulSoup

I am learning BS4. I parsed some div class. But I want to get data in div code. ` [
0
votes
0 answers

BeautifulSoup Returning Empty Brackets

from bs4 import BeautifulSoup import requests import html5lib url = 'https://twitter.com/st3phensparkman' result = requests.get(url) doc = BeautifulSoup(result.text, 'html5lib') followers = doc.find_all(text='Followers') print(followers) For…
0
votes
0 answers

missing in Selenium Python page_source

I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a tag. The unit test checks with…
Deepstop
  • 3,627
  • 2
  • 8
  • 21
0
votes
2 answers

xml.etree.ElementTree: How to replace like "innerHTML"?

I want to replace the

tag of a html page. But the content of the heading can be HTML (not just a string). I want to insert foo bold bar input: start

bar italic

end Desired output: start
guettli
  • 25,042
  • 81
  • 346
  • 663
0
votes
1 answer

How to replace the innerHTML of all

tags with html5lib?

How to replace the innerHTML of all tags with html5lib? input: foo

Moonlight

bar Desired output: foo

Sunshine

bar I would like to use html5lib, since it is already a dependency.
guettli
  • 25,042
  • 81
  • 346
  • 663

0
votes
1 answer

"ValueError: No tables found matching regex '.+'" at random times when scraping large amounts of data

this is my first project with pandas and selenium so I may be making a dumb mistake. I've written this function to go through a list of nba players and scrape their game logs into data frames. It all works well but occasionally when I'm going…
0
votes
1 answer

Glitch in html5lib?

I'm getting this error. Is it a bug or is it a code error? What does it mean? Traceback (most recent call last): File "isc.py", line 8, in import requests, os, sys, bs4 File…
Mayank
  • 1
  • 5
0
votes
3 answers

How to scrape the different content with the same html attributes and values?

I'm able to scrape a bunch of data from a webpage, but I'm struggling with extracting the specific content from subsections that have the exact same attributes and values. Here is the html:
  • Relationship Issues …
  • Tom
    • 196
    • 1
    • 10
    0
    votes
    2 answers

    Scraping multiple URLs using BeautifulSoup

    I am trying to scrape a website, however, I was unable to complete the code so that I could insert several URLs at once. Currently the code is functional with one URL at a time, The current code is: import requests from bs4 import…
    0
    votes
    1 answer

    I am trying to click on expand button and then scrape the table

    I am scraping a website table form https://csr.gov.in/companyprofile.php?year=FY+2015-16&CIN=L00000CH1990PLC010573 but I am not getting the exact result I am looking for. I want 11 columns from this link, "company name", "Class", "State", "Company…
    0
    votes
    1 answer

    Remove a bad tag completely with html5lib.sanitizer

    I'm trying to use html5lib.sanitizer to clean user-input as suggested in the docs The problem is I want to remove bad tags completely and not just escape them (which seems like a bad idea anyway). The workaround suggested in the patch here doesn't…
    letoosh
    • 511
    • 2
    • 6
    • 13
    {name}