How to extract textual contents from a web page?

Question

I'm developing an application in java which can take textual information from different web pages and will summarize it into one page.For example,suppose I have a news on different web pages like Hindu,Times of India,Statesman,etc.Now my application is supposed to extract important points from each one of these pages and will put them together as a single news.The application is based on concepts of web content mining.As a beginner to this field,I can't understand where to start off.I have gone through research papers which explains noise removal as first step in buiding this application.

So,if I'm given a news web page the very first step is to extract main news from the page excluding hyperlinks,advertisements,useless images,etc. My question is how can I do this ? Please give me some good tutorials which explains the implementation of such kind of application using web content mining.Or at least give me some hint how to accomplish it ?

score 9 · Accepted Answer · answered Feb 09 '12 at 17:17

9

You can use readability or boilerpipe, two open source tools for this task. For a tutorial you should read the code & documentation for those two projects.

answered Feb 09 '12 at 17:17

Spike Gronim

6,154
22
21

I have heard about boilerpipe and its pretty good but I want to do it on my own so that I can learn from it.Can you please tell me how can I approach to do it ? What steps should be followed ? – dark_shadow Feb 09 '12 at 18:02
2

Search Google scholar for papers on the subject. Read the code of the existing implementations. Build an evaluation corpus of websites and the correct text extraction. Calculate how accurately each extractor works. Look at the errors, think about how to fix them, improve the extractor. – Spike Gronim Feb 09 '12 at 18:28

How to extract textual contents from a web page?

1 Answers1

Linked

Related