Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions
15
votes
6 answers

Is there anything for Python that is like readability.js?

I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js http://lab.arc90.com/experiments/readability http://lab.arc90.com/experiments/readability/js/readability.js so that I can give…
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
13
votes
3 answers

What HTML parsing libraries do you recommend in Java

I want to parse some HTML in order to find the values of some attributes/tags etc. What HTML parsers do you recommend? Any pros and cons?
pek
  • 17,847
  • 28
  • 86
  • 99
11
votes
4 answers

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block…
VoY
  • 5,479
  • 2
  • 37
  • 45
9
votes
3 answers

How do I save a web page, programmatically?

I would like to save a web page programmatically. I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing. The…
8
votes
5 answers

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc. I'm…
JJ.
  • 4,974
  • 5
  • 39
  • 48
8
votes
2 answers

Using MSXML2.XMLHTTP in Excel VBA to extract large amounts of text data from website

I am trying to download historical stock price data from finance.yahoo.com for 1000s of stocks. The website only displays 60 days of data on a single page so I have to loop through the time period that I am downloading for along with the loop for…
sinhars82
  • 124
  • 1
  • 1
  • 8
7
votes
2 answers

HTML article content extraction - Alchemy API alternative

I've been doing a lot of research to figure out the best way to code an application to get the main article content from almost any HTML webpage. I have a C program that uses libxml2 to parse through the XML, but I came across Alchemy API, which…
Manoj Solanki
  • 96
  • 1
  • 6
7
votes
2 answers

BeautifulSoup - easy way to to obtain HTML-free contents

I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different…
Andrea Ambu
  • 38,188
  • 14
  • 54
  • 77
7
votes
3 answers

Is there a way to use readability and python to extract just text, not HTML?

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those. early version by gfxmonk, based on BeautifulSoup version by minvolai based on…
6
votes
4 answers

Is there a boilerpipe port for .net?

Does anybody know a .net port for the boilerpipe library?
aogan
  • 2,241
  • 1
  • 15
  • 24
6
votes
4 answers

How extract meaningful text from HTML

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this? I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for…
Nisanio
  • 4,056
  • 5
  • 34
  • 46
6
votes
3 answers

Getting BeautifulSoup to find a specific

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my…
Ryan
  • 85
  • 1
  • 2
  • 6
5
votes
6 answers

best way to extract info from the web delphi

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com' I'm currently using the IndyHttp components to get the page and i'm using strUtils…
Gab
  • 681
  • 4
  • 14
  • 27
5
votes
6 answers

How do you parse a poorly formatted HTML file?

I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is…
ivo
  • 4,101
  • 5
  • 33
  • 42
1
2
3
14 15