Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions

votes

2 answers

Extracting the introduction part of a Wikipedia article, by python

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in. Can anyone…

python html-content-extraction

asked Nov 28 '10 at 02:37

green-i

votes

1 answer

How to extract only main content from any web page? (without footer, menu bars,navigation bar, footer, side bar, breadcrumb)

I have extracted whole body content by using this code. But I don't know have to Remove navigation bar, footer, side bar, breadcrumb. Can anyone suggest me how to get this done? foreach($dom->getElementsByTagName("body")->item(0)->childNodes as…

php data-extraction html-content-extraction

asked Jan 13 '17 at 11:20

Manoj Kumar K

votes

1 answer

Extracting data using screenscrapers

I am looking for recommendations for a screenscraper I need to extract "Contact Us" information from certain web sites. Any ideas where I can get a good (pref free) screenscarper?

screen-scraping html-content-extraction

asked Jan 17 '10 at 15:44

LearningCSharp

1,292
2
13
26

votes

2 answers

Using Beautiful Soup Python module to replace tags with plain text

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it. I was able to successfully get most of the content…

python html-content-extraction

asked Jan 14 '10 at 01:58

Ecognium

2,046
1
19
35

votes

1 answer

Looking for a free alternative to Webzinc .NET, screen scraping, web automation libraries for .NET

I came across this .NET library: http://www.webzinc.com/online/faq.aspx However, I was wondering if there was a free alternative out there?

.net screen-scraping screen html-content-extraction

asked Dec 23 '09 at 09:52

gpow

votes

5 answers

Screen scraping HTTPS using C#

How to screen scrape HTTPS using C#?

c# https screen-scraping html-content-extraction

asked Dec 04 '09 at 15:30

Jignesh

votes

7 answers

Python HTML scraping

It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: I want to get the href value. Any ideas on how to do this?…

python html regex screen-scraping html-content-extraction

asked Nov 24 '09 at 23:23

pns

votes

1 answer

What ruby gem provides the function to extract the content from web pages?

I'm searching for a ruby gem for my ruby on rails project for extracting content from web pages. I found the ruby-readability gem, but it does not support multiple pages on articles. Can you reccomend a gem who also supports multiple page article…

ruby-on-rails rubygems html-content-extraction

asked Jan 11 '13 at 17:58

sn3ek

1,929
3
22
32

votes

2 answers

HTML comment scraping in PHP

I've been looking around but have yet to find a solution. I'm trying to scrape an HTML document and get the text between two comments however have been unable to do this successfully so far. I'm using PHP and have tried the PHP Simple DOM parser…

php html parsing screen-scraping html-content-extraction

asked Aug 26 '09 at 05:55

Pep

votes

1 answer

Extract all inline css of concernet html

I want to extract all inline styles of the concerned html. For example, below is the concerned html for which inline css is to be extracted:

Hello…

javascript jquery css html-parsing html-content-extraction

asked Mar 13 '12 at 09:08

S Singh

1,403
9
31
47

votes

1 answer

how to use boilerpipe with a local html file?

I have an html file on my local disk and would like to extract text from it using BoilerPipe. The "getText" method from the class ExtractorBase accepts a reader, so I wrote: FileReader fr = new…

java html-content-extraction boilerpipe

asked Nov 28 '11 at 11:57

seinecle

10,118
14
61
120

votes

4 answers

How do I read HTML Document in C# given that I have the webpage source stored in a string variable?

I have tried to do this on my own but couldn't. I have an html document, and I'm trying to extract the addresses for all the pictures in it into a c# collection and I'm not sure of the syntax. I'm using HTMLAgilityPack... Here is what I have so far.…

c# html html-agility-pack html-content-extraction

asked Nov 25 '11 at 08:51

cSharpDotNetGuy

votes

5 answers

How to write a regular expression for html parsing?

I'm trying to write a regular expression for my html parser. I want to match a html tag with given attribute (eg.

with class="tab news selected" ) that contains one or more tags. The regexp should match the entire tag (from

to…

c++ html regex boost html-content-extraction

asked Apr 27 '09 at 08:41

zajcev

votes

1 answer

php, get between function improvement - add array support

I have a function which extracts the content between 2 strings. I use it to extract specific information between html tags . However it currently works to extract only the first match so I would like to know if it would be possible to improve it in…

php regex html-content-extraction

asked Jun 20 '11 at 12:14

Michael

6,377
14
59
91

votes

0 answers

Extracting Headings/Chapters and related paragraphs separately from PDF file in Python 3.7

My task is to fetch chapter-wise content from pdf file separately so that i can store into database. So far, i tried regex and tried to split but that only gives me chapter number but didn't help me in splitting the chapters. Next i tried…

python-3.x text-extraction data-extraction html-content-extraction

asked May 21 '20 at 13:03

Ashish Jadon

Prev 1 2 3

…

14 15 Next