Article Extraction - Ruby

Question

Is there any option to extract only the content from a webpage using ruby. (Avoid links and other stuffs)

score 2 · Answer 1 · answered Jan 23 '14 at 04:50

2

To do web scraping you should use gem mechanize with nokogiri for dom parsing.

answered Jan 23 '14 at 04:50

MrPizzaFace

7,807
15
79
123

I had used mechanize to scrap, but boilerpipe library works better to extract only article-content of the web page in python. I want to know whether there is a gem similar to boilerpipe. – Mothirajha Jan 23 '14 at 05:00
`Mechanize` is fastest library for the job and `nokogiri` will allow you to just scrape the parts of the page that you want. (the article) – MrPizzaFace Jan 23 '14 at 05:09
Is it possible to scrap the content from different websites without passing css or html tags using mechanize and nokogiri???? – Mothirajha Jan 23 '14 at 05:56
You cant scrape with any tool without parsing the dom. You need to tell it what you want to scrape. – MrPizzaFace Jan 23 '14 at 05:59
No I don't share my personal contact information. If you want help post your questions here and someone will reply. I need to go soon so if you want to share the URL I can get you started with the XPATH – MrPizzaFace Jan 23 '14 at 06:18
This is the one http://tomazkovacic.com/blog/14/extracting-article-text-from-html-documents/. I require only the content from overview to conclusion not any other stuff... – Mothirajha Jan 23 '14 at 06:21
Ok everything that you want can be scraped using this XPATH `//*[contains(concat( " ", @class, " " ), concat( " ", "entry-title", " " ))] | //*[contains(concat( " ", @class, " " ), concat( " ", "entry-content", " " ))]` I'm going to bed now. Good luck. – MrPizzaFace Jan 23 '14 at 06:25
I've ported Boilerpipe to a pure ruby implementation https://rubygems.org/gems/boilerpipe-ruby – Gregory Ostermayr Sep 08 '17 at 21:06

score 0 · Answer 2 · answered Jan 23 '14 at 04:55

0

I would recommend Scrapy. It's Python, not Ruby but it's awesome what you can do with very little effort.

answered Jan 23 '14 at 04:55

S. A.

3,714
2
20
31

Thank you, i will go through Scrapy. – Mothirajha Jan 23 '14 at 05:41

Article Extraction - Ruby

2 Answers2