2

I'm using Anemone to store crawled pages into MongoDB. It mostly works, except for accessing the page headers when I retrieve a page from MongoDB.

When I call collection.find_one("http://stackoverflow.com") I'll get the correct object from the data store, but I can't acecss the headers.

Anemone stores the headers as a hash, so theoretically, after retreiving the document, I should be able to do something like

document["headers"]["content-type"]

but that doesn't work because document["headers"] is a BSON::Binary.

puts document["headers"]

displays a mixture of text and binary characters.

How can I create a usable ruby hash object from the binary data that comes back from MongoDB?

EDIT: I haven't solved the original problem, but was able to modify Anemone so that I can have it load the data for me, which seems to work:

class NewMongo < Anemone::Storage::MongoDB
    def initialize(mongo_db, collection_name)
        @db = mongo_db
        @collection = @db[collection_name]
        #Do not delete the collection! I need it!
        #@collection.remove
        @collection.create_index 'url'
    end
end

And then later on...

repo = NewMongo.new(db, "pages")
repo.each db |url, page|
    puts page.content_type
end
  • Have you dug through the Anemone source to find where it puts the headers into MongoDB? Does the Anemone documentation have anything to say? That might tell you what format it is using at least. – mu is too short May 23 '13 at 21:22
  • 1
    Yes, but that hasn't helped any. Looking at https://github.com/chriskite/anemone/blob/next/lib/anemone/page.rb it seems to do `'headers' => Marshal.dump(@headers)` in to_hash. The MongoDB adapter (https://github.com/chriskite/anemone/blob/next/lib/anemone/storage/mongodb.rb) then creates the BSON object: `hash[field] = BSON::Binary.new(hash[field]) unless hash[field].nil?` – Cole Fichter May 23 '13 at 21:30
  • The [`Marshal.dump`](http://ruby-doc.org/core-2.0/Marshal.html#method-c-dump) call is a bit of a give away, no? How do you unpack something that has been packed up with [`Marshal.dump`](http://ruby-doc.org/core-2.0/Marshal.html#method-c-dump)? – mu is too short May 23 '13 at 22:32
  • I suppose that if I weren't a beginner, I would agree with your sarcasm. However, I did try `Marshal.load(...)` which told me "instance of IO needed". And, as with most things ruby, googling for help on that error produced nothing in the way of helpful hints. – Cole Fichter May 24 '13 at 14:06
  • You could try the `data` method on the [`BSON::Binary`](http://rubydoc.info/github/mongodb/bson-ruby/BSON/Binary) object to get the raw data and then feed that to one of the [`Marshal`](http://ruby-doc.org/core-2.0/Marshal.html) methods. I don't have Anemone set up so I can't tell you exactly what to do, all I can do is offer some possibilities based on a bit of research. I must admit that I find Anemone's behavior here very odd when using MongoDB for storage, I don't know why they don't just throw the whole Hash in, MongoDB would be happy with that. – mu is too short May 25 '13 at 00:44

1 Answers1

1

If the data was stored in a Binary format by the Anemone storage backend there isn't much you can do unless you know the format or there is a deserializer they provide. It sounds like that would be a bad choice for storing the header as the hash would be a more natural form for it.

Tyler Brock
  • 29,626
  • 15
  • 79
  • 79