Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

See also .

5960 questions
46
votes
8 answers

Read a HTML file into a string variable in memory

If I have a HTML file on disk, How can I read it all at once in to a String variable at run time? Then I need to do some processing on that string variable. Some html file like this:
Bohn
  • 26,091
  • 61
  • 167
  • 254
44
votes
11 answers

Automatically convert Style Sheets to inline style

Don't have to worry about linked style or hover style. I want to automatically convert files like this

...

to files like…
700 Software
  • 85,281
  • 83
  • 234
  • 341
44
votes
2 answers

Difference between "findAll" and "find_all" in BeautifulSoup

I would like to parse an HTML file with Python, and the module I am using is BeautifulSoup. It is said that the function find_all is the same as findAll. I've tried both of them, but I believe they are different: import urllib, urllib2,…
Oberon
  • 628
  • 1
  • 6
  • 12
43
votes
1 answer

xpath find node that does not contain child

I'm trying to create some xpath that will find all a tags that do not contain img tags, so that something such as link matches, but does not. Of…
Ben K.
  • 1,779
  • 5
  • 18
  • 21
41
votes
1 answer

TagSoup vs. Jsoup vs. HTML Parser vs. HotSax vs

The abundance of HTML parsers to choose from (and stick with) is mind boggling: http://java-source.net/open-source/html-parsers How do I choose one that best suits the following requirements: Mature (fewer bugs than the rest) Live and breathing…
Regex Rookie
  • 10,432
  • 15
  • 54
  • 88
41
votes
3 answers

parse html inside ng-bind using angularJS

I'm having issue with angularJs. My application requests some data from the server and one of the values from the data returned from the server is a string of html. I am binding it in my angular template like this…
Subtubes
  • 15,851
  • 22
  • 70
  • 105
40
votes
5 answers

Web scraping in PHP

I'm looking for a way to make a small preview of another page from a URL given by the user in PHP. I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is…
federico-t
  • 12,014
  • 19
  • 67
  • 111
40
votes
6 answers

Is there a built-in HTML validator in any major browser?

In Firefox, there's a Extension called “Html Validator”. It adds a little indicator icon at the bottom right corner of your window. When a page you visit isn't valid, it lights up. You can click on it to see the errors. The really important feature…
Xah Lee
  • 16,755
  • 9
  • 37
  • 43
40
votes
4 answers

How to get node value / innerHTML with XPath?

I have a XPath to select to a class I want: //div[@class='myclass']. But it returns me the whole div (with the
also, but I would like to return only the contents of this tag without the tag itself. How can I do it?
Tom Smykowski
  • 25,487
  • 54
  • 159
  • 236
38
votes
6 answers

How to parse malformed HTML in python, using standard libraries

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing. I've found plenty of great third-party libraries for this task, but this question is about the python standard…
bukzor
  • 37,539
  • 11
  • 77
  • 111
38
votes
2 answers

beautiful soup getting tag.id

I'm attempting to get a list of div ids from a page. When I print out the attributes, I get the ids listed. for tag in soup.find_all(class_="bookmark blurb group") : print(tag.attrs) results in: {'id': 'bookmark_8199633', 'role': 'article',…
klreeher
  • 1,391
  • 2
  • 15
  • 27
37
votes
8 answers

How to parse an HTML string in Google Apps Script without using XmlService?

I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it. The main idea is to use: var html =…
37
votes
10 answers

What is the best practice for parsing remote content with jQuery?

Following a jQuery ajax call to retrieve an entire XHTML document, what is the best way to select specific elements from the resulting string? Perhaps there is a library or plugin that solves this issue? jQuery can only select XHTML elements that…
slypete
  • 5,538
  • 11
  • 47
  • 64
36
votes
6 answers

Writing an HTML Parser

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree. After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references…
James
  • 2,483
  • 2
  • 24
  • 31
36
votes
11 answers

Cleaning HTML by removing extra/redundant formatting tags

I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these…
Aziz
  • 20,065
  • 8
  • 63
  • 69