0

I'm trying to scrape concert data from a bunch of different websites. I've written several scripts, each of which scrapes data from a particular website and organizes it into a hash with a predictable structure. Each of these scripts has the function scrape to accomplish this. I have all of these scraper scripts in a directory. I want to then write a master script which, for each script in the directory, calls that script's scrape and adds the data into the database.

I plan on making this master script a Resque worker such that the site will scrape each site in the background daily.

How do I accomplish this in the master script? Right now I go through the directory like so:

Dir.glob(/app/workers/scraped_venues/*.rb) do |venue_scraper|
  # call that script's `scrape` function
  # use data from that `scrape` call
end
Eric
  • 636
  • 6
  • 16

1 Answers1

0

I'd make a rake task that does the scraping:

Rake tutorial: http://jasonseifer.com/2010/04/06/rake-tutorial

and use the whenever gem to run the scraping each day. Should be painfully easy to figure out from docs:

https://github.com/javan/whenever

How to run things in the background in rails: Ruby on Rails: How to run things in the background?

Community
  • 1
  • 1
AJcodez
  • 31,780
  • 20
  • 84
  • 118
  • Thanks for your response. Would I put the rake task in Rakefile? And I'm assuming the task should still call a Resque worker so that the task is done in the background? Also, just FYI, Whenever is a great gem but doesn't play so nice with Heroku, the scheduler add-on is much easier. – Eric Jan 03 '13 at 21:31
  • Also! Forgot the main issue for me. As I don't know in advance what venues I'm scraping, I won't know the name of the classes held in each scraper script. So how do I handle this? EG a script might be CBGB_sraper.rb that holds the scrape function inside a CBGBScraper class, but my rake task won't know that.. – Eric Jan 03 '13 at 21:33
  • ^ i feel like there might be an easier way to organize that? and you put rake tasks in lib/tasks – AJcodez Jan 03 '13 at 23:47
  • Thanks for bearing with me. What if each of the scripts doesn't actually hold a class, but is just a script (no functions)? – Eric Jan 04 '13 at 04:07
  • are you asking how to run ruby code? You could put it in a module and call it, reopening the module for each site's specific parser; you could run the files using the command line arguments within ruby with backticks the equivalent to `$ ruby path/ruby_file.rb`; you could require each script and use some metaprogramming; you could put it all in a gem and update the gem when needed... it depends – AJcodez Jan 04 '13 at 07:12
  • thanks AJcodez, still having trouble but I'll accept your answer for now at least. much appreciated – Eric Jan 05 '13 at 22:09