0

Is there any option to extract only the content from a webpage using ruby. (Avoid links and other stuffs)

Mothirajha
  • 1,033
  • 1
  • 10
  • 18

2 Answers2

2

To do web scraping you should use gem mechanize with nokogiri for dom parsing.

MrPizzaFace
  • 7,807
  • 15
  • 79
  • 123
  • I had used mechanize to scrap, but boilerpipe library works better to extract only article-content of the web page in python. I want to know whether there is a gem similar to boilerpipe. – Mothirajha Jan 23 '14 at 05:00
  • `Mechanize` is fastest library for the job and `nokogiri` will allow you to just scrape the parts of the page that you want. (the article) – MrPizzaFace Jan 23 '14 at 05:09
  • Is it possible to scrap the content from different websites without passing css or html tags using mechanize and nokogiri???? – Mothirajha Jan 23 '14 at 05:56
  • You cant scrape with any tool without parsing the dom. You need to tell it what you want to scrape. – MrPizzaFace Jan 23 '14 at 05:59
  • No I don't share my personal contact information. If you want help post your questions here and someone will reply. I need to go soon so if you want to share the URL I can get you started with the XPATH – MrPizzaFace Jan 23 '14 at 06:18
  • This is the one http://tomazkovacic.com/blog/14/extracting-article-text-from-html-documents/. I require only the content from overview to conclusion not any other stuff... – Mothirajha Jan 23 '14 at 06:21
  • Ok everything that you want can be scraped using this XPATH `//*[contains(concat( " ", @class, " " ), concat( " ", "entry-title", " " ))] | //*[contains(concat( " ", @class, " " ), concat( " ", "entry-content", " " ))]` I'm going to bed now. Good luck. – MrPizzaFace Jan 23 '14 at 06:25
  • I've ported Boilerpipe to a pure ruby implementation https://rubygems.org/gems/boilerpipe-ruby – Gregory Ostermayr Sep 08 '17 at 21:06
0

I would recommend Scrapy. It's Python, not Ruby but it's awesome what you can do with very little effort.

S. A.
  • 3,714
  • 2
  • 20
  • 31