Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions

votes

6 answers

Is there anything for Python that is like readability.js?

I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js http://lab.arc90.com/experiments/readability http://lab.arc90.com/experiments/readability/js/readability.js so that I can give…

asked May 27 '10 at 12:53

Emre Sevinç

8,211
14
64
105

votes

3 answers

What HTML parsing libraries do you recommend in Java

I want to parse some HTML in order to find the values of some attributes/tags etc. What HTML parsers do you recommend? Any pros and cons?

java html parsing html-content-extraction

asked Aug 25 '08 at 18:54

pek

17,847
28
86
99

votes

4 answers

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block…

algorithm html html-content-extraction

asked Jan 04 '10 at 12:22

VoY

5,479
2
37
45

votes

3 answers

How do I save a web page, programmatically?

I would like to save a web page programmatically. I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing. The…

caching web-applications screen-scraping html-content-extraction

asked Nov 13 '09 at 22:32

Joseph Turian

15,430
14
47
62

votes

5 answers

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc. I'm…

python html parsing semantics html-content-extraction

asked Apr 28 '09 at 06:40

JJ.

4,974
5
39
48

votes

2 answers

Using MSXML2.XMLHTTP in Excel VBA to extract large amounts of text data from website

I am trying to download historical stock price data from finance.yahoo.com for 1000s of stocks. The website only displays 60 days of data on a single page so I have to loop through the time period that I am downloading for along with the loop for…

excel vba msxml html-content-extraction

asked Mar 02 '14 at 08:10

sinhars82

votes

2 answers

HTML article content extraction - Alchemy API alternative

I've been doing a lot of research to figure out the best way to code an application to get the main article content from almost any HTML webpage. I have a C program that uses libxml2 to parse through the XML, but I came across Alchemy API, which…

html html-content-extraction alchemyapi

asked Nov 08 '10 at 14:03

Manoj Solanki

votes

2 answers

How to parse HTML with C++/Qt?

How can i parse the following HTML 12345 Hello I would like to retrive the data "12345" from a "span" with style="font-size:11px" from www.testtest.com, but I only want the that very data,…

c++ qt qtwebkit html-content-extraction qtcore

asked Sep 07 '13 at 19:01

NPLS

votes

2 answers

BeautifulSoup - easy way to to obtain HTML-free contents

I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different…

python beautifulsoup html-parsing html-content-extraction

asked Nov 17 '09 at 23:38

Andrea Ambu

38,188
14
54
77

votes

3 answers

Is there a way to use readability and python to extract just text, not HTML?

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those. early version by gfxmonk, based on BeautifulSoup version by minvolai based on…

python readability text-extraction html-content-extraction

asked Jun 22 '12 at 06:15

Michael Kariv

1,421
13
20

votes

4 answers

Is there a boilerpipe port for .net?

Does anybody know a .net port for the boilerpipe library?

c# .net text-extraction html-content-extraction boilerpipe

asked Jan 02 '12 at 20:42

aogan

2,241
1
15
24

votes

4 answers

How extract meaningful text from HTML

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this? I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for…

html c ruby html-parsing html-content-extraction

asked Oct 19 '10 at 14:30

Nisanio

4,056
5
34
46

votes

3 answers

Getting BeautifulSoup to find a specific

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my…

python beautifulsoup html-content-extraction

asked Mar 26 '10 at 06:32

Ryan

votes

6 answers

best way to extract info from the web delphi

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com' I'm currently using the IndyHttp components to get the page and i'm using strUtils…

delphi parsing html-content-extraction information-extraction

asked Jan 13 '12 at 00:03

Gab

votes

6 answers

How do you parse a poorly formatted HTML file?

I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is…

html parsing text html-content-extraction

asked Apr 02 '09 at 17:10

ivo

4,101
5
33
42

Prev 1

…

14 15 Next