Web crawler in Rails to extract links and download files from web page

Question

I'm using RoR, I will specify a link to a web page in my application and here are the things that I want to do

(1) I want to extract all the links in the web page

(2) Find if they are links to pdf file(basically a pattern match)

(3)I want to download file in link(a pdf for example) and store them on my system.

I tried using Anemone, but it crawls the entire website which overshoots my needs and also how do I download the files in corresponding links?

Cheers

score 10 · Accepted Answer · answered Feb 04 '11 at 13:11

Have a look at Nokogiri aswell.

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.thatwebsite.com/downloads'))

doc.css('a').each do |link|
  if link['href'] =~ /\b.+.pdf/
    begin
      File.open('filename_to_save_to.pdf', 'wb') do |file|
        downloaded_file = open(link['href'])
        file.write(downloaded_file.read())
      end
    rescue => ex
      puts "Something went wrong...."
    end
  end
end

You might want to do some better exception catching, but I think you get the idea :)

This is just what I was looking for to do a small personal project. Thanks! — lyonsinbeta, Aug 02 '12 at 03:18

Tarscher · Answer 2 · 2011-02-04T10:40:51.843

1

Have you tried scrapi? You can scrape the page with css selectors.

Ryan Bates also made a screencast about it.

To download the files you can use open-uri

require 'open-uri'  
url = "http://example.com/document.pdf"
file = open(url)  
c = file.read()

edited Feb 04 '11 at 10:40

answered Feb 04 '11 at 10:34

Tarscher

1,923
1
22
45

But I have trouble using 'scrapi', I'm using ruby 1.8.7. It says Scraper::Reader::HTMLParseError: Unable to load /Library/Ruby/Gems/1.8/gems/scrapi-1.2.0/lib/tidy/libtidy.dylib – theReverseFlick Feb 04 '11 at 12:41

Web crawler in Rails to extract links and download files from web page

2 Answers2