Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions
5
votes
2 answers

Http Agility Pack - Accessing Siblings?

Using the HTML Agility Pack is great for getting descendants and whole tables etc... but how can you use it in the below situation ...Html Code above...
Location:
City, London
Jay
  • 2,715
  • 8
  • 33
  • 33
5
votes
4 answers

Get the rendered text from HTML (Delphi)

I have some HTML and I need to extract the actual written text from the page. So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is…
Daisetsu
  • 4,846
  • 11
  • 50
  • 70
5
votes
7 answers

What's the best way to write a maintainable web scraping app?

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl…
Benj
  • 31,668
  • 17
  • 78
  • 127
5
votes
3 answers

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't…
Johnny4000
  • 187
  • 2
  • 10
5
votes
4 answers

How to integrate HTML pages into WordPress?

I have a page in HTML(index.html), and a folders named images, css, js that used in it. Now i have to do this in WordPress. Is there any plug in to convert Html to WordPress or any other way to do this in WordPress? Please help me.. i'm a beginner…
capri
  • 2,047
  • 5
  • 19
  • 11
4
votes
4 answers

Extracting the body text of an HTML document using PHP

I know it's better to use DOM for this purpose but let's try to extract the text in this way:

Some text

EOD; preg_match('//', $html, $matches,…
bobo
  • 8,439
  • 11
  • 57
  • 81
4
votes
5 answers

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following…
Max
  • 6,901
  • 7
  • 46
  • 61
4
votes
3 answers

Strip HTML from a web page and calculate word frequency?

In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter. Finally, let me mention again that I'd like to do this in…
anon
4
votes
1 answer

Allowing basic html markup in django

Im creating an app that will process user submitted content. I would like to enable users to make their text-based content look pretty with basic html markup i.e < i > < b > < br > . However I do want to prevent them from using things like script…
Niels
  • 1,513
  • 3
  • 20
  • 29
4
votes
3 answers

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version…
3
votes
4 answers

Extracting tag content based on content value using BeautifulSoup

I have a Html document of the following format.

   1. Content of the paragraph in italic but not strong ignore.

I want to extract the content of paragraph tag, including the content of…
Gopal
  • 1,372
  • 2
  • 16
  • 32
3
votes
10 answers

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources. The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I…
ansgri
  • 2,126
  • 5
  • 25
  • 37
3
votes
2 answers

HTTPBuilder - How can I get the HTML content of a web page?

I need to extract the HTML of a web page I'm using HTTPuilder in groovy, making the following get: def http = new HTTPBuilder('http://www.google.com/search') http.request(Method.GET) { requestContentType = ContentType.HTML response.success = {…
3
votes
2 answers

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel,…
3
votes
2 answers

How to extract text from HTML using htmlagilitypack for this sample?

I wanna extract the text from a HTML source. I'm trying with c# and htmlagilitypack dll. The source is:
Here 2
1 2
3
14 15