Is there any option to extract only the content from a webpage using ruby. (Avoid links and other stuffs)
Asked
Active
Viewed 482 times
2 Answers
2
To do web scraping you should use gem mechanize
with nokogiri
for dom parsing.

MrPizzaFace
- 7,807
- 15
- 79
- 123
-
I had used mechanize to scrap, but boilerpipe library works better to extract only article-content of the web page in python. I want to know whether there is a gem similar to boilerpipe. – Mothirajha Jan 23 '14 at 05:00
-
`Mechanize` is fastest library for the job and `nokogiri` will allow you to just scrape the parts of the page that you want. (the article) – MrPizzaFace Jan 23 '14 at 05:09
-
Is it possible to scrap the content from different websites without passing css or html tags using mechanize and nokogiri???? – Mothirajha Jan 23 '14 at 05:56
-
You cant scrape with any tool without parsing the dom. You need to tell it what you want to scrape. – MrPizzaFace Jan 23 '14 at 05:59
-
No I don't share my personal contact information. If you want help post your questions here and someone will reply. I need to go soon so if you want to share the URL I can get you started with the XPATH – MrPizzaFace Jan 23 '14 at 06:18
-
This is the one http://tomazkovacic.com/blog/14/extracting-article-text-from-html-documents/. I require only the content from overview to conclusion not any other stuff... – Mothirajha Jan 23 '14 at 06:21
-
Ok everything that you want can be scraped using this XPATH `//*[contains(concat( " ", @class, " " ), concat( " ", "entry-title", " " ))] | //*[contains(concat( " ", @class, " " ), concat( " ", "entry-content", " " ))]` I'm going to bed now. Good luck. – MrPizzaFace Jan 23 '14 at 06:25
-
I've ported Boilerpipe to a pure ruby implementation https://rubygems.org/gems/boilerpipe-ruby – Gregory Ostermayr Sep 08 '17 at 21:06