Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

See also .

5960 questions
84
votes
6 answers

BeautifulSoup findAll() given multiple classes?

I would like to scrape a list of items from a website, and preserve the order that they are presented in. These items are organized in a table, but they can be one of two different classes (in random order). Is there any way to provide multiple…
sebo
  • 1,584
  • 4
  • 16
  • 19
83
votes
8 answers

Extracting an information from web page by machine learning

I would like to extract a specific type of information from web pages in Python. Let's say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to…
Honza Javorek
  • 8,566
  • 8
  • 47
  • 66
74
votes
7 answers

HTML5: W3C vs WHATWG. Which gives the most authoritative spec?

I'm in halfway trough an html parser and found html5 defined explicitly the rules of thumb for parsing ill formed html. (And I used to infer them from DTDs, sigh) I love that fact, but I know well that html5 isn't finalized yet (also I wonder if it…
ZJR
  • 9,308
  • 5
  • 31
  • 38
69
votes
29 answers

Can you provide examples of parsing HTML?

How do you parse HTML with a variety of languages and parsing libraries? When answering: Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things. For the…
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
69
votes
4 answers

jquery-like HTML parsing in Python?

Is there any Python library that allows me to parse an HTML document similar to what jQuery does? i.e. I'd like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc. The only…
Roy Tang
  • 5,643
  • 9
  • 44
  • 74
64
votes
10 answers

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments. What's a generic way…
kefeizhou
  • 6,234
  • 10
  • 42
  • 55
63
votes
5 answers

HTML Agility pack - parsing tables

I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model. I looked at the link example, but did not find any table data this way. Can I use XPath to get the tables? I am basically lost…
weismat
  • 7,195
  • 3
  • 43
  • 58
62
votes
7 answers

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn…
Monika Sulik
  • 16,498
  • 15
  • 50
  • 52
58
votes
4 answers

How can I get at the matches when using preg_replace in PHP?

I am trying to grab the capital letters of a couple of words and wrap them in span tags. I am using preg_replace for extract and wrapping purposes, but it's not outputting anything. preg_replace("/[A-Z]/", "$1", $str)
Polsonby
  • 22,825
  • 19
  • 59
  • 74
56
votes
0 answers

How to parse HTML with PHP?

Possible Duplicate: How to parse and process HTML with PHP? Suggestion for a reference question. Stack Overflow has dozens of "How to parse HTML" questions coming in every day. However, it is very difficult to close as a duplicate because most…
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
53
votes
4 answers

Web Scraping With Haskell

What is the current state of libraries for scraping websites with Haskell? I'm trying to make myself do more of my quick oneoff tasks in Haskell, in order to help increase my comfort level with the language. In Python, I tend to use the excellent…
ricree
  • 35,626
  • 13
  • 36
  • 27
51
votes
8 answers

What is parsing?

Parsing is something I come across a lot in development, but as a junior it is one of those things I assume I will get the hang of at some point, when it is needed. In my current project I've been told to find and use an HTML parser for a certain…
Grace
  • 2,548
  • 5
  • 26
  • 23
46
votes
7 answers

HTML Text with tags to formatted text in an Excel cell

Is there a way to take HTML and import it to excel so that it is formatted as rich text (preferably by using VBA)? Basically, when I paste to an Excel cell, I'm looking to turn this:

This is a test. Will this text be bold or…

Kevin McGovern
  • 591
  • 1
  • 8
  • 25
46
votes
4 answers

How can I use the python HTMLParser library to extract data from a specific div tag?

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this HTML element: ...
20
... This is my HTMLParser class so far: class…
Martin
  • 10,294
  • 11
  • 63
  • 83
46
votes
5 answers

How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For…
alex
  • 479,566
  • 201
  • 878
  • 984