Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions

votes

2 answers

Http Agility Pack - Accessing Siblings?

Using the HTML Agility Pack is great for getting descendants and whole tables etc... but how can you use it in the below situation ...Html Code above...

Location:: City, London

votes

4 answers

Get the rendered text from HTML (Delphi)

I have some HTML and I need to extract the actual written text from the page. So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is…

html delphi html-parsing html-content-extraction

asked Jun 08 '10 at 21:29

Daisetsu

4,846
11
50
70

votes

7 answers

What's the best way to write a maintainable web scraping app?

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl…

perl webforms screen-scraping html-content-extraction

asked Nov 09 '09 at 11:17

Benj

31,668
17
78
127

votes

3 answers

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't…

python html text html-content-extraction

asked Oct 23 '09 at 18:11

Johnny4000

votes

4 answers

How to integrate HTML pages into WordPress?

I have a page in HTML(index.html), and a folders named images, css, js that used in it. Now i have to do this in WordPress. Is there any plug in to convert Html to WordPress or any other way to do this in WordPress? Please help me.. i'm a beginner…

php wordpress content-management-system html-content-extraction

asked Jun 14 '12 at 06:15

capri

2,047
5
19
11

votes

4 answers

Extracting the body text of an HTML document using PHP

I know it's better to use DOM for this purpose but let's try to extract the text in this way:

Some text

EOD; preg_match('//', $html, $matches,…

php regex text text-processing html-content-extraction

asked Feb 06 '11 at 01:42

bobo

8,439
11
57
81

votes

5 answers

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following…

html screen-scraping text-extraction html-content-extraction

asked Nov 14 '08 at 08:04

Max

6,901
7
46
61

votes

3 answers

Strip HTML from a web page and calculate word frequency?

In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter. Finally, let me mention again that I'd like to do this in…

java html groovy html-content-extraction text-extraction

asked Oct 16 '08 at 04:02

anon

votes

1 answer

Allowing basic html markup in django

Im creating an app that will process user submitted content. I would like to enable users to make their text-based content look pretty with basic html markup i.e < i > < b > < br > . However I do want to prevent them from using things like script…

html django django-templates html-content-extraction

asked Aug 04 '13 at 23:11

Niels

1,513
3
20
29

votes

3 answers

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version…

html artificial-intelligence nlp html-content-extraction text-extraction

asked Nov 08 '09 at 15:42

Ankur Gupta

2,284
4
27
40

votes

4 answers

Extracting tag content based on content value using BeautifulSoup

I have a Html document of the following format.

1. Content of the paragraph in italic but not strong ignore.

I want to extract the content of paragraph tag, including the content of…

python beautifulsoup html-content-extraction

asked Jan 18 '12 at 11:40

Gopal

1,372
2
16
32

votes

10 answers

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources. The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I…

java html html-content-extraction

asked Sep 16 '08 at 11:48

ansgri

2,126
5
25
37

votes

2 answers

HTTPBuilder - How can I get the HTML content of a web page?

I need to extract the HTML of a web page I'm using HTTPuilder in groovy, making the following get: def http = new HTTPBuilder('http://www.google.com/search') http.request(Method.GET) { requestContentType = ContentType.HTML response.success = {…

html-content-extraction httpbuilder

asked Jul 25 '11 at 13:35

NachoAsking

votes

2 answers

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel,…

python css screen-scraping beautifulsoup html-content-extraction

asked Feb 26 '09 at 23:21

rawnd

votes

2 answers

How to extract text from HTML using htmlagilitypack for this sample?

I wanna extract the text from a HTML source. I'm trying with c# and htmlagilitypack dll. The source is:

Here 2

c# linq xpath html-agility-pack html-content-extraction

asked May 03 '11 at 13:56

bigbada

Prev 1 2

…

14 15 Next