1

I am preparing to write my first web crawler, and it looks like Anemone makes the most sense. There is built in support for MongoDB storage, and I am already using MongoDB via Mongoid in my Rails application. My goal is to store the crawled results, and then access them later via Rails. I have a couple of concerns:

1) At the end of this page, it says that "Note: Every storage engine will clear out existing Anemone data before beginning a new crawl." I would expect this to happen at the end of the crawl if I were using the default memory storage, but shouldn't the records be persisted to MongoDB indefinitely so that duplicate pages are not crawled next time the task is run? If they are wiped "before beginning a new crawl", then should I just run my Rails logic before the next crawl? If so, then I would end up having to check for duplicate records from the previous crawl.

2) This is the first time I have really thought about using MongoDB outside the context of Rails models. It looks like the records are created using the Page class, so can I later just query these as I normally would using Mongoid? I guess it is just considered a "model" once it has an ORM providing the fancy methods?

Micah Alcorn
  • 2,363
  • 2
  • 22
  • 45

1 Answers1

3

Great questions.

1) It depends on what your goal is.

In most cases this default makes sense. One does a crawl with anemone and examines the data.

When you do a new crawl, the old data should be erased so that the data from the new crawl can replace it.

You could point the storage engine at a new collection before starting the new crawl if you don't want that to happen.

2) Mongoid won't create the model classes for you.

You need to define models so that mongoid knows to create a class for the collection, and optionally define the fields that each of the documents have so that you can use the . accessor method out of the box.

Something like:

class Page
  include Mongoid::Document
  field :url, type: String #i'm guessing, check what kind of docs anemone produces
  field :aliases, type: Array
  field ....
end

It will probably need to include the following fields:

  • url - The URL of the page
  • aliases - Other URLs that redirected to this page, or the Page that this one redirects to headers - The full HTTP response headers
  • code - The HTTP response code (e.g. 200, 301, 404)
  • body - The raw HTTP response body
  • doc - A Nokogiri::HTML::Document of the page body (if applicable)
  • links - An Array of all the URLs found on the page that point to the same domain

But please just take a look at what type (string, array, whatever) the storage engine is storing them as and don't make assumptions.

Good luck!

Tyler Brock
  • 29,626
  • 15
  • 79
  • 79
  • "You could point the storage engine at a new collection before starting the new crawl if you don't want that to happen." How is this done? – sunnyrjuneja Feb 24 '12 at 19:56
  • 2
    You can pass a database and collection name into the storage when you initialize it: Anemone::Storage.MongoDB('db_name', 'collection_name') – Tyler Brock Feb 24 '12 at 23:35