Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions
2
votes
3 answers

How to extract values from HTML using RegEx?

Given the following HTML:

OAK RIDGE, N.J., March 16, 2011 /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:

Paul Fryer
  • 9,268
  • 14
  • 61
  • 93
2
votes
1 answer

How to get the links from all the embedded videos on a webpage?

Let me explain. What I'm trying to do is, given a certain webpage I want to get the count of how many embedded videos and their links. I'm not asking for the code itself, but some pieces of information on how to achieve that.
Gustavo
  • 21
  • 1
  • 2
2
votes
1 answer

Generic Article Extraction from web pages

Am going to begin my work in article extraction. The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1.…
LGAP
  • 2,365
  • 17
  • 50
  • 71
2
votes
2 answers

XQuery extract between two tags

I am currently working on extracting data from HTML. I would like to extract the text between two

tags.

XYZ:

asdfghjk

sdsdsd

Technocrat
  • 211
  • 2
  • 4
  • 11
2
votes
1 answer

Extraction of main content of an article (JavaScript)

I'm writing a program that reads a general HTML "article" page (Wikipedia, NY Times, Yahoo News, ect). From that page I want to strip away all of the "noise" (ads, header bars.. anything that isn't part of the article content.) To think about it…
2
votes
1 answer

parsing HTML in swift

Can anyone help me out with this one: I have a HTTP page formatted this way:
2
votes
0 answers

Loop for Extracting Detailed HTML tables from multiple webpages into Excel

I would like to extract info from each page on the http://www.adac.de/infotestrat/autodatenbank/suchergebnis.aspx when I go into details for each auto (after clicking "Suchen" (eng. Search)). E.g. first row…
2
votes
3 answers

php : parse html : extract script tags from body and inject before ?

I don't care what the library is, but I need a way to extract <.script.> elements from the <.body.> of a page (as string). I then want to insert the extracted <.script.>s just before <./body.>. Ideally, I'd like to extract the <.script.>s into 2…
theclueless1
  • 123
  • 1
  • 1
  • 11
2
votes
2 answers

Any ideas about the jQuery equivalent of the READABILITY code? (Or: building the best heuristic to find the main text using jQuery)

http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source…
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
2
votes
3 answers

Reading source code from a webpage in java

I am trying to read source code from a webpage. My java code is import java.net.*; import java.io.*; import java.util.*; import javax.swing.JOptionPane; class Testing{ public static void Connect() throws Exception{ URL url = new…
Ahmad Ali
  • 85
  • 1
  • 1
  • 5
2
votes
1 answer

extract information from a website using Qt?

I'd like extract the information in the b tag => 123456789 this is the HTML source :
NPLS
  • 521
  • 2
  • 8
  • 20
2
votes
4 answers

How to extract data from a raw HTML file?

Is there a way to extract desired data from a raw html which has been written unsemantically with no IDs and classes? I mean, suppose there is a saved html file of a webpage (profile) and I want to extract the data like (say) 'hobbies'. Is it…
apnerve
  • 4,740
  • 5
  • 29
  • 45
2
votes
5 answers

PHP - how to get main HTML content like Reader Mode in Firefox

in android Firefox app and safari iPad we can read only main content by "Reader Mode". read more... How to recognize only main content in HTML with PHP? I need to detect main news like Firefox or safari by php for example I get news from…
Milad Ghiravani
  • 1,625
  • 23
  • 43
2
votes
3 answers

Scraping from wsj.com or finance.yahoo.com

I want to display on a wordpress page the total volume of shares traded on the NYSE stock exchange the last 2 weeks that it's been open. What is the best way to go about doing this?
pg.
  • 2,503
  • 4
  • 42
  • 67
2
votes
3 answers

How to programmatically extract information from a web page, using Linux command line?

I need to extract the exchange rate of USD to another currency (say, EUR) for a long list of historical dates. The www.xe.com website gives the historical lookup tool, and using a detailed URL, one can get the rate table for a specific date, w/o…
ysap
  • 7,723
  • 7
  • 59
  • 122