Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

BeautifulSoup findAll() given multiple classes?

I would like to scrape a list of items from a website, and preserve the order that they are presented in. These items are organized in a table, but they can be one of two different classes (in random order). Is there any way to provide multiple…

python html beautifulsoup html-parsing

asked Sep 10 '13 at 17:53

sebo

1,584
4
16
19

votes

8 answers

Extracting an information from web page by machine learning

I would like to extract a specific type of information from web pages in Python. Let's say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to…

python machine-learning html-parsing web-scraping extract

asked Nov 11 '12 at 23:27

Honza Javorek

8,566
8
47
66

votes

7 answers

HTML5: W3C vs WHATWG. Which gives the most authoritative spec?

I'm in halfway trough an html parser and found html5 defined explicitly the rules of thumb for parsing ill formed html. (And I used to infer them from DTDs, sigh) I love that fact, but I know well that html5 isn't finalized yet (also I wonder if it…

html html-parsing w3c

asked Jul 26 '11 at 05:38

ZJR

9,308
5
31
38

votes

29 answers

Can you provide examples of parsing HTML?

How do you parse HTML with a variety of languages and parsing libraries? When answering: Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things. For the…

html language-agnostic html-parsing

asked Apr 21 '09 at 15:55

Chas. Owens

64,182
22
135
226

votes

4 answers

jquery-like HTML parsing in Python?

Is there any Python library that allows me to parse an HTML document similar to what jQuery does? i.e. I'd like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc. The only…

python jquery css-selectors html-parsing

asked Jun 16 '10 at 07:12

Roy Tang

5,643
9
44
74

votes

10 answers

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments. What's a generic way…

python web-scraping html-parsing html

asked Jan 12 '11 at 17:46

kefeizhou

6,234
10
42
55

votes

5 answers

HTML Agility pack - parsing tables

I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model. I looked at the link example, but did not find any table data this way. Can I use XPath to get the tables? I am basically lost…

c# html html-parsing html-agility-pack

asked Mar 17 '09 at 19:00

weismat

7,195
3
43
58

votes

7 answers

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn…

python beautifulsoup html-parsing lxml

asked Dec 17 '09 at 14:08

Monika Sulik

16,498
15
50
52

votes

4 answers

How can I get at the matches when using preg_replace in PHP?

I am trying to grab the capital letters of a couple of words and wrap them in span tags. I am using preg_replace for extract and wrapping purposes, but it's not outputting anything. preg_replace("/[A-Z]/", "$1", $str)

php regex html-parsing preg-replace

asked Aug 05 '08 at 00:35

Polsonby

22,825
19
59
74

votes

0 answers

How to parse HTML with PHP?

Possible Duplicate: How to parse and process HTML with PHP? Suggestion for a reference question. Stack Overflow has dozens of "How to parse HTML" questions coming in every day. However, it is very difficult to close as a duplicate because most…

php html regex html-parsing

asked Sep 06 '10 at 08:51

Pekka

442,112
142
972
1,088

votes

4 answers

Web Scraping With Haskell

What is the current state of libraries for scraping websites with Haskell? I'm trying to make myself do more of my quick oneoff tasks in Haskell, in order to help increase my comfort level with the language. In Python, I tend to use the excellent…

haskell html-parsing web-scraping

asked Jan 29 '11 at 17:02

ricree

35,626
13
36
27

votes

8 answers

What is parsing?

Parsing is something I come across a lot in development, but as a junior it is one of those things I assume I will get the hang of at some point, when it is needed. In my current project I've been told to find and use an HTML parser for a certain…

c# parsing html-parsing

asked Nov 24 '09 at 09:02

Grace

2,548
5
26
23

votes

7 answers

HTML Text with tags to formatted text in an Excel cell

Is there a way to take HTML and import it to excel so that it is formatted as rich text (preferably by using VBA)? Basically, when I paste to an Excel cell, I'm looking to turn this:

This is a test. Will this text be bold or…

vba excel html-parsing

asked Apr 03 '12 at 19:06

Kevin McGovern

votes

4 answers

How can I use the python HTMLParser library to extract data from a specific div tag?

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this HTML element: ...

... This is my HTMLParser class so far: class…

python html parsing html-parsing

asked Jul 18 '10 at 15:06

Martin

10,294
11
63
83

votes

5 answers

How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For…

html browser parsing html-parsing tokenize

asked Jun 30 '10 at 14:36

alex

479,566
201
878
984

Prev 1

…

99 100 Next