Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions
3
votes
2 answers

Extracting the introduction part of a Wikipedia article, by python

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in. Can anyone…
green-i
  • 315
  • 2
  • 16
3
votes
1 answer

How to extract only main content from any web page? (without footer, menu bars,navigation bar, footer, side bar, breadcrumb)

I have extracted whole body content by using this code. But I don't know have to Remove navigation bar, footer, side bar, breadcrumb. Can anyone suggest me how to get this done? foreach($dom->getElementsByTagName("body")->item(0)->childNodes as…
Manoj Kumar K
  • 73
  • 1
  • 2
  • 10
3
votes
1 answer

Extracting data using screenscrapers

I am looking for recommendations for a screenscraper I need to extract "Contact Us" information from certain web sites. Any ideas where I can get a good (pref free) screenscarper?
LearningCSharp
  • 1,292
  • 2
  • 13
  • 26
3
votes
2 answers

Using Beautiful Soup Python module to replace tags with plain text

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it. I was able to successfully get most of the content…
Ecognium
  • 2,046
  • 1
  • 19
  • 35
3
votes
1 answer

Looking for a free alternative to Webzinc .NET, screen scraping, web automation libraries for .NET

I came across this .NET library: http://www.webzinc.com/online/faq.aspx However, I was wondering if there was a free alternative out there?
gpow
  • 711
  • 3
  • 8
  • 18
3
votes
5 answers

Screen scraping HTTPS using C#

How to screen scrape HTTPS using C#?
Jignesh
  • 165
  • 2
  • 5
  • 13
3
votes
7 answers

Python HTML scraping

It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: I want to get the href value. Any ideas on how to do this?…
pns
  • 413
  • 1
  • 8
  • 19
3
votes
1 answer

What ruby gem provides the function to extract the content from web pages?

I'm searching for a ruby gem for my ruby on rails project for extracting content from web pages. I found the ruby-readability gem, but it does not support multiple pages on articles. Can you reccomend a gem who also supports multiple page article…
sn3ek
  • 1,929
  • 3
  • 22
  • 32
3
votes
2 answers

HTML comment scraping in PHP

I've been looking around but have yet to find a solution. I'm trying to scrape an HTML document and get the text between two comments however have been unable to do this successfully so far. I'm using PHP and have tried the PHP Simple DOM parser…
Pep
  • 145
  • 1
  • 2
  • 4
2
votes
1 answer

Extract all inline css of concernet html

I want to extract all inline styles of the concerned html. For example, below is the concerned html for which inline css is to be extracted:
Hello…
S Singh
  • 1,403
  • 9
  • 31
  • 47
2
votes
1 answer

how to use boilerpipe with a local html file?

I have an html file on my local disk and would like to extract text from it using BoilerPipe. The "getText" method from the class ExtractorBase accepts a reader, so I wrote: FileReader fr = new…
seinecle
  • 10,118
  • 14
  • 61
  • 120
2
votes
4 answers

How do I read HTML Document in C# given that I have the webpage source stored in a string variable?

I have tried to do this on my own but couldn't. I have an html document, and I'm trying to extract the addresses for all the pictures in it into a c# collection and I'm not sure of the syntax. I'm using HTMLAgilityPack... Here is what I have so far.…
2
votes
5 answers

How to write a regular expression for html parsing?

I'm trying to write a regular expression for my html parser. I want to match a html tag with given attribute (eg.
with class="tab news selected" ) that contains one or more tags. The regexp should match the entire tag (from
to…
zajcev
2
votes
1 answer

php, get between function improvement - add array support

I have a function which extracts the content between 2 strings. I use it to extract specific information between html tags . However it currently works to extract only the first match so I would like to know if it would be possible to improve it in…
Michael
  • 6,377
  • 14
  • 59
  • 91
2
votes
0 answers

Extracting Headings/Chapters and related paragraphs separately from PDF file in Python 3.7

My task is to fetch chapter-wise content from pdf file separately so that i can store into database. So far, i tried regex and tried to split but that only gives me chapter number but didn't help me in splitting the chapters. Next i tried…