3

I'm searching for a ruby gem for my ruby on rails project for extracting content from web pages. I found the ruby-readability gem, but it does not support multiple pages on articles. Can you reccomend a gem who also supports multiple page article extraction?

Or how can I code the ability to recognise multiple sites on articles?

Thanks

sn3ek
  • 1,929
  • 3
  • 22
  • 32

1 Answers1

4

You can use a high level gem like Pismo in combination with Mechanize to iteratevely go through each page and concatenate the body of the article. For that you need to know what link brings you to the next page. Google is pushing for the adoption of a convention based on the rel attribute

<a href="blog-post?page=2" rel='next'>next</a>

Here's a very very rough draft of ruby code:

agent = WWW::Mechanize.new
agent.get("http://www.awesomeblog.com/amazing-article")

scraper.text = MyScraper.new(:text => Pismo::Document.new(agent.url))

while agent.page.link_with("rel='next'").click do
  pismo_doc = Pismo::Document.new(agent.url)
  scraper.text << pismo_doc.lede
end

scraper.save!

This is pseudo code/wilde guess (I don't know the API of mechanize) but you get the general idea.

charlysisto
  • 3,700
  • 17
  • 30
  • pismo is great, better than ruby-readability. But it does not support multiple page articles: [this one](http://arstechnica.com/gadgets/2013/01/work-it-the-ultimate-smartphone-guide-part-v/). It only extracts me the current page and nothing more. How does [pocket](http://getpocket.com) provide this functionality to extract more than one page of such multiple page article? – sn3ek Jan 11 '13 at 18:21
  • Yup Instapaper knows how do it as well. I'm afraid you'll have to do low level scraping for such a functionality. It depends on the html strucure of each site. Some conventions are being pushed by google to better index content for articles like adding rel='next'. But it's a jungle out there :-) – charlysisto Jan 11 '13 at 18:31
  • @charlysisto yes I know, but your answer does not answer my question. I know gems like nokogiri or pismo etc. But my question was: *how to do*. – sn3ek Jan 11 '13 at 18:34
  • @sn3ek you're right, I read the question to fast. I'll give it a try, but it'll be a direction to follow not the hole thing... – charlysisto Jan 11 '13 at 18:38
  • Thanks for your example code. That helped a lot. But it's problematic if there is no rel link 'next'. I tried parsing the link structure. It works for some pages. But there are an amount of other problems. So for now it does not solve the entire problem. – sn3ek Jan 17 '13 at 13:58