Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

DOMDocument in php

I have just started reading documentation and examples about DOM, in order to crawl and parse the document. For example I have part of document shown below:

…

php xml-parsing html-parsing domdocument

asked Feb 12 '11 at 18:33

Saikios

3,623
7
37
51

votes

1 answer

beautifulsoup: find_all on bs4.element.ResultSet object or list?

Hi so I apply find_all on a beautifulsoup object, and find something, which is an bs4.element.ResultSet object or a list. I want to further do find_all in there, but it's not allowed on a bs4.element.ResultSet object. I can loop through each…

python html beautifulsoup html-parsing

asked Mar 18 '16 at 04:17

YJZ

3,934
11
43
67

votes

2 answers

Why does a stray
end tag generate an empty paragraph?

Apparently, if you have a

end tag with no matching start tag within the body element, most if not all browsers will generate an empty paragraph in its place:

Even if any text exists around…

html dom syntax html-parsing

asked Jul 19 '12 at 23:42

BoltClock

700,868
160
1,392
1,356

votes

2 answers

Remove attributes using HtmlAgilityPack

I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack. Here's my code: var elements = htmlDoc.DocumentNode.SelectNodes("//*"); if (elements!=null) { foreach (var element in elements) { …

html html-parsing html-agility-pack

asked May 01 '11 at 19:19

Ted Nyberg

7,001
7
41
72

votes

1 answer

Get immediate parent tag with BeautifulSoup in Python

I've researched this question but haven't seen an actual solution to solving this. I'm using BeautifulSoup with Python and what I'm looking to do is get all image tags from a page, loop through each and check each to see if it's immediate parent is…

python html beautifulsoup html-parsing

asked Jan 10 '15 at 09:12

stwhite

3,156
4
37
70

votes

1 answer

Selenium: Iterating through groups of elements

I've done this with BeautifulSoup but it's a bit cumbersome, and I'm trying to figure out if I can do it directly with Selenium. Let's say I have the following HTML, which repeats multiple times in the page source with identical elements but…

python html selenium beautifulsoup html-parsing

asked Nov 19 '14 at 00:17

AutomaticStatic

1,661
3
21
42

votes

2 answers

HtmlAgilityPack : illegal characters in path

I'm getting an "illegal characters in path" error in this code. I've mentioned "Error Occuring Here" as a comment in the line where the error is occuring. var document = htmlWeb.Load(searchUrl); var hotels = document.DocumentNode.Descendants("div") …

c# html-parsing html-agility-pack

asked Feb 21 '14 at 07:07

Pranab

votes

1 answer

Symfony DomCrawler: Find element with specific attribute value

I'm using the DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html I'd like to, using the CSS like syntax, get an element with a specific attribute value. Here's the code I'm using: $link = $crawler->filter('#product…

php dom symfony html-parsing

asked Apr 30 '13 at 13:20

user429620

votes

2 answers

PowerShell - HTML parsing: get information from a website

Update, Script is working with PowerShell V3.0, Thanks @ Doug I want to use the following PowerShell script to get flight status information from Lufthansa. I can see flight status information in the browser, but I haven't found any way to access…

powershell html-parsing

asked Jan 29 '12 at 13:36

LaPhi

5,675
21
56
78

votes

5 answers

Android ImageGetter images overlapping text

I'm trying to load a block of HTML into a TextView, including images, using URLImageParser p = new URLImageParser(articleBody, this); Spanned htmlSpan = Html.fromHtml(parsedString, p, null); parsedString is the HTML, by the way. Anyway, it loads…

android textview html-parsing spanned

asked Oct 24 '11 at 00:28

Nick

6,900
5
45
66

votes

5 answers

How to safely embed JSON with in HTML document?

In a Rails 3.1 app, how can I safely embed some JSON data into an HTML document? Suppose I have this in a controller action: @tags = [ {name:"tag1", color:"green"}, {name:"I can do something bad here", color:"red"} ] And…

ruby-on-rails json ruby-on-rails-3 html-parsing

asked Aug 26 '11 at 14:08

nnc

1,003
2
8
9

votes

3 answers

Nokogiri vs Hpricot?

Which one would you choose? My important attributes are (not in order): Support and future enhancements. Community and general knowledge base (on the Internet). Comprehensive (I.E., proven to parse a wide range of *.*ml pages). Performance. Memory…

ruby nokogiri html-parsing hpricot

asked May 22 '10 at 15:05

roshan

1,323
18
31

votes

2 answers

Set lxml as default BeautifulSoup parser

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this: soup = bs4.BeautifulSoup(html, 'lxml') but I don't want…

python html beautifulsoup html-parsing lxml

asked Jan 06 '15 at 00:49

Adam Hammes

votes

5 answers

Parse the JavaScript returned from BeautifulSoup

I would like to parse the webpage http://dcsd.nutrislice.com/menu/meadow-view/lunch/ to grab today's lunch menu. (I've built an Adafruit #IoT Thermal Printer and I'd like to automatically print the menu each day.) I initially approached this using…

javascript python beautifulsoup html-parsing

asked Jan 11 '14 at 23:35

Wade

votes

4 answers

How can I use regular expression to grab an 'img' tag?

I want to grab an img tag from text returned from JSON data like that. I want to grab this from a string:

What is the…

regex image html-parsing

asked Sep 06 '13 at 19:15

eng.ahmed

Prev 1 2 3

…

99 100 Next

Crap