Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

Read a HTML file into a string variable in memory

If I have a HTML file on disk, How can I read it all at once in to a String variable at run time? Then I need to do some processing on that string variable. Some html file like this:

c# html file-io html-parsing

asked Aug 29 '12 at 18:05

Bohn

26,091
61
167
254

votes

11 answers

Automatically convert Style Sheets to inline style

Don't have to worry about linked style or hover style. I want to automatically convert files like this

...

to files like…

java css html-parsing

asked Dec 23 '10 at 18:46

700 Software

85,281
83
234
341

votes

2 answers

Difference between "findAll" and "find_all" in BeautifulSoup

I would like to parse an HTML file with Python, and the module I am using is BeautifulSoup. It is said that the function find_all is the same as findAll. I've tried both of them, but I believe they are different: import urllib, urllib2,…

python xml-parsing html-parsing beautifulsoup

asked Sep 09 '12 at 13:08

Oberon

votes

1 answer

xpath find node that does not contain child

I'm trying to create some xpath that will find all a tags that do not contain img tags, so that something such as link matches, but

does not. Of…

xpath html-parsing xml-parsing

asked Mar 28 '11 at 19:48

Ben K.

1,779
5
18
21

votes

1 answer

TagSoup vs. Jsoup vs. HTML Parser vs. HotSax vs

The abundance of HTML parsers to choose from (and stick with) is mind boggling: http://java-source.net/open-source/html-parsers How do I choose one that best suits the following requirements: Mature (fewer bugs than the rest) Live and breathing…

java android html-parsing

asked Mar 03 '11 at 16:45

Regex Rookie

10,432
15
54
88

votes

3 answers

parse html inside ng-bind using angularJS

I'm having issue with angularJs. My application requests some data from the server and one of the values from the data returned from the server is a string of html. I am binding it in my angular template like this…

javascript angularjs html-parsing

asked Feb 15 '13 at 05:31

Subtubes

15,851
22
70
105

votes

5 answers

Web scraping in PHP

I'm looking for a way to make a small preview of another page from a URL given by the user in PHP. I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is…

php html curl html-parsing web-scraping

asked Mar 21 '12 at 21:39

federico-t

12,014
19
67
111

votes

6 answers

Is there a built-in HTML validator in any major browser?

In Firefox, there's a Extension called “Html Validator”. It adds a little indicator icon at the bottom right corner of your window. When a page you visit isn't valid, it lights up. You can click on it to see the errors. The really important feature…

html html-parsing

asked Apr 10 '11 at 22:13

Xah Lee

16,755
9
37
43

votes

4 answers

How to get node value / innerHTML with XPath?

I have a XPath to select to a class I want: //div[@class='myclass']. But it returns me the whole div (with the

also, but I would like to return only the contents of this tag without the tag itself. How can I do it?

xml parsing xpath html-parsing

asked Jun 05 '12 at 13:16

Tom Smykowski

25,487
54
159
236

votes

6 answers

How to parse malformed HTML in python, using standard libraries

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing. I've found plenty of great third-party libraries for this task, but this question is about the python standard…

python html dom parsing html-parsing

asked Apr 20 '10 at 16:29

bukzor

37,539
11
77
111

votes

2 answers

beautiful soup getting tag.id

I'm attempting to get a list of div ids from a page. When I print out the attributes, I get the ids listed. for tag in soup.find_all(class_="bookmark blurb group") : print(tag.attrs) results in: {'id': 'bookmark_8199633', 'role': 'article',…

python html beautifulsoup html-parsing

asked Jul 25 '14 at 18:55

klreeher

1,391
2
15
27

votes

8 answers

How to parse an HTML string in Google Apps Script without using XmlService?

I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it. The main idea is to use: var html =…

javascript parsing google-apps-script google-sheets html-parsing

asked Nov 24 '15 at 11:59

user3347814

1,138
9
28
50

votes

10 answers

What is the best practice for parsing remote content with jQuery?

Following a jQuery ajax call to retrieve an entire XHTML document, what is the best way to select specific elements from the resulting string? Perhaps there is a library or plugin that solves this issue? jQuery can only select XHTML elements that…

jquery html-parsing

asked Jun 23 '09 at 20:10

slypete

5,538
11
47
64

votes

6 answers

Writing an HTML Parser

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree. After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references…

html parsing html-parsing

asked Aug 25 '11 at 14:26

James

2,483
2
24
31

votes

11 answers

Cleaning HTML by removing extra/redundant formatting tags

I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these…

php html dom html-parsing bbcode

asked Apr 20 '12 at 14:26

Aziz

20,065
8
63
69

Prev 1 2

…

99 100 Next