Questions tagged [html-parsing]

HTML parsing is the process of consuming a serialization of an HTML document and producing a representation that you can work with programmatically — e.g., in order to extract data from it. The HTML specification defines a standard algorithm for parsing HTML, which is implemented in all major browsers.

HTML parsing typically involves converting an HTML document to a tree-based Document Object Model (DOM)

https://html.spec.whatwg.org/multipage/parsing.html#parsing has the standard algorithm for parsing HTML, which is implemented in all major browsers.

See also .

5960 questions
35
votes
9 answers

Problem with HTML Parser in IE

I am trying to create a dialog box that will appear only if the browser selected is IE (any version) however I get this error: Message: HTML Parsing Error: Unable to modify the parent container element before the child element is closed…
Tsundoku
  • 9,104
  • 29
  • 93
  • 127
32
votes
3 answers

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text. I am using HtmlAgilityPack I use this code to extract all texts HtmlDocument doc = new HtmlDocument(); doc.Load("some path") foreach (HtmlNode node in…
Shahin
  • 12,543
  • 39
  • 127
  • 205
32
votes
5 answers

Android HTML ImageGetter as AsyncTask

Okay, I'm losing my mind over this one. I have a method in my program which parses HTML. I want to include the inline images, and I am under the impression that using the Html.fromHtml(string, Html.ImageGetter, Html.TagHandler) will allow this to…
Nick
  • 6,900
  • 5
  • 45
  • 66
31
votes
12 answers

jQuery-like interface for PHP?

I was curious as to whether or not there exists a jQuery-style interface/library for PHP for handling HTML/XML files -- specifically using jQuery style selectors. I'd like to do things like this (all hypothetical): foreach (j("div > p > a") as…
theotherlight
  • 783
  • 1
  • 8
  • 12
30
votes
2 answers

HTML Agility Pack strip tags NOT IN whitelist

I'm trying to create a function which removes html tags and attributes which are not in a white list. I have the following HTML: first text second text here some text here some text here some twxt…
Dragos Durlut
  • 8,018
  • 10
  • 47
  • 62
29
votes
1 answer

HtmlAgility - Save parsing to a string

Just tried using the HtmlAgility Pack for the first time and have a problem. First I load in from a string variable. string NewsText = dr["Message"].ToString(); HtmlAgilityPack.HtmlDocument htmlDoc = new…
larschanders
  • 1,951
  • 3
  • 16
  • 21
29
votes
3 answers

Python BeautifulSoup scrape tables

I am trying to create a table scrape with BeautifulSoup. I wrote this Python code: import urllib2 from bs4 import BeautifulSoup url = "http://dofollow.netsons.org/table1.htm" # change to whatever your url is page =…
kingcope
  • 1,121
  • 4
  • 19
  • 36
29
votes
4 answers

How can I add "current streak" of contributions from github to my blog?

I have a personal blog I built using rails. I want to add a section to my site that displays my current streak of github contributions. What would be the best way about doing this? edit: for clarification, here is what I want: just the number of…
Ox Smith
  • 509
  • 1
  • 7
  • 20
29
votes
1 answer

Get text content of an HTML element using XPath?

See this html

Monitor $300

Add to cart

Keyboard $20

Add to cart
Using xpath…
Genghis Khan
  • 970
  • 2
  • 11
  • 21
28
votes
3 answers

How do I convert a document made in Jsoup (the Java html parser) into a string

I have a document that was made in jsoup that looks like this Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); How do i convert that doc into a string.
Hudson Hughes
  • 343
  • 1
  • 3
  • 9
28
votes
9 answers

Is it possible to get data from HTML forms into android while using webView?

I'm making a very simple form in HTML which is viewed in android using the webview which takes in your name using a textbox and when you click on the button, it displays it into a paragraph and it's made using both html and javascript. This is my…
Shariq Musharaf
  • 997
  • 2
  • 10
  • 25
28
votes
1 answer

Differences between .text and .get_text()

In BeautifulSoup, is there any difference between .text and .get_text()? Which one should be preferred for getting element's text? >>> from bs4 import BeautifulSoup >>> >>> html = "
text1 text2
" >>> soup = BeautifulSoup(html,…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
27
votes
6 answers

Parsing HTML in Python

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated. I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure…
Andy Baker
  • 21,158
  • 12
  • 58
  • 71
27
votes
5 answers

JavaScript DOM childNodes.length also returning number of text nodes

In JavaScript DOM, childNodes.length returns the number of both element and text nodes. Is there any way to count only the number of element-only child nodes? For example, childNodes.length of div#posts will return 6, when I expected 2:
Samuel Liew
  • 76,741
  • 107
  • 159
  • 260
27
votes
6 answers

Extract links from a web page using Go lang

I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page? Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess…
Jifeng Zhang
  • 5,037
  • 4
  • 30
  • 43