Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

See also .

5960 questions
26
votes
2 answers

DOMDocument in php

I have just started reading documentation and examples about DOM, in order to crawl and parse the document. For example I have part of document shown below:
Saikios
  • 3,623
  • 7
  • 37
  • 51
26
votes
1 answer

beautifulsoup: find_all on bs4.element.ResultSet object or list?

Hi so I apply find_all on a beautifulsoup object, and find something, which is an bs4.element.ResultSet object or a list. I want to further do find_all in there, but it's not allowed on a bs4.element.ResultSet object. I can loop through each…
YJZ
  • 3,934
  • 11
  • 43
  • 67
26
votes
2 answers

Why does a stray

end tag generate an empty paragraph?

Apparently, if you have a

end tag with no matching start tag within the body element, most if not all browsers will generate an empty paragraph in its place:

Even if any text exists around…
BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
25
votes
2 answers

Remove attributes using HtmlAgilityPack

I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack. Here's my code: var elements = htmlDoc.DocumentNode.SelectNodes("//*"); if (elements!=null) { foreach (var element in elements) { …
Ted Nyberg
  • 7,001
  • 7
  • 41
  • 72
25
votes
1 answer

Get immediate parent tag with BeautifulSoup in Python

I've researched this question but haven't seen an actual solution to solving this. I'm using BeautifulSoup with Python and what I'm looking to do is get all image tags from a page, loop through each and check each to see if it's immediate parent is…
stwhite
  • 3,156
  • 4
  • 37
  • 70
25
votes
1 answer

Selenium: Iterating through groups of elements

I've done this with BeautifulSoup but it's a bit cumbersome, and I'm trying to figure out if I can do it directly with Selenium. Let's say I have the following HTML, which repeats multiple times in the page source with identical elements but…
AutomaticStatic
  • 1,661
  • 3
  • 21
  • 42
25
votes
2 answers

HtmlAgilityPack : illegal characters in path

I'm getting an "illegal characters in path" error in this code. I've mentioned "Error Occuring Here" as a comment in the line where the error is occuring. var document = htmlWeb.Load(searchUrl); var hotels = document.DocumentNode.Descendants("div") …
Pranab
  • 382
  • 3
  • 10
25
votes
1 answer

Symfony DomCrawler: Find element with specific attribute value

I'm using the DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html I'd like to, using the CSS like syntax, get an element with a specific attribute value. Here's the code I'm using: $link = $crawler->filter('#product…
user429620
24
votes
2 answers

PowerShell - HTML parsing: get information from a website

Update, Script is working with PowerShell V3.0, Thanks @ Doug I want to use the following PowerShell script to get flight status information from Lufthansa. I can see flight status information in the browser, but I haven't found any way to access…
LaPhi
  • 5,675
  • 21
  • 56
  • 78
24
votes
5 answers

Android ImageGetter images overlapping text

I'm trying to load a block of HTML into a TextView, including images, using URLImageParser p = new URLImageParser(articleBody, this); Spanned htmlSpan = Html.fromHtml(parsedString, p, null); parsedString is the HTML, by the way. Anyway, it loads…
Nick
  • 6,900
  • 5
  • 45
  • 66
24
votes
5 answers

How to safely embed JSON with in HTML document?

In a Rails 3.1 app, how can I safely embed some JSON data into an HTML document? Suppose I have this in a controller action: @tags = [ {name:"tag1", color:"green"}, {name:"I can do something bad here", color:"red"} ] And…
nnc
  • 1,003
  • 2
  • 8
  • 9
24
votes
3 answers

Nokogiri vs Hpricot?

Which one would you choose? My important attributes are (not in order): Support and future enhancements. Community and general knowledge base (on the Internet). Comprehensive (I.E., proven to parse a wide range of *.*ml pages). Performance. Memory…
roshan
  • 1,323
  • 18
  • 31
24
votes
2 answers

Set lxml as default BeautifulSoup parser

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this: soup = bs4.BeautifulSoup(html, 'lxml') but I don't want…
Adam Hammes
  • 820
  • 1
  • 8
  • 22
24
votes
5 answers

Parse the JavaScript returned from BeautifulSoup

I would like to parse the webpage http://dcsd.nutrislice.com/menu/meadow-view/lunch/ to grab today's lunch menu. (I've built an Adafruit #IoT Thermal Printer and I'd like to automatically print the menu each day.) I initially approached this using…
Wade
  • 741
  • 1
  • 5
  • 18
24
votes
4 answers

How can I use regular expression to grab an 'img' tag?

I want to grab an img tag from text returned from JSON data like that. I want to grab this from a string: What is the…
eng.ahmed
  • 905
  • 4
  • 16
  • 38
Crap