I'm using Anemone to store crawled pages into MongoDB. It mostly works, except for accessing the page headers when I retrieve a page from MongoDB.
When I call collection.find_one("http://stackoverflow.com")
I'll get the correct object from the data store, but I can't acecss the headers.
Anemone stores the headers as a hash, so theoretically, after retreiving the document, I should be able to do something like
document["headers"]["content-type"]
but that doesn't work because document["headers"]
is a BSON::Binary.
puts document["headers"]
displays a mixture of text and binary characters.
How can I create a usable ruby hash object from the binary data that comes back from MongoDB?
EDIT: I haven't solved the original problem, but was able to modify Anemone so that I can have it load the data for me, which seems to work:
class NewMongo < Anemone::Storage::MongoDB
def initialize(mongo_db, collection_name)
@db = mongo_db
@collection = @db[collection_name]
#Do not delete the collection! I need it!
#@collection.remove
@collection.create_index 'url'
end
end
And then later on...
repo = NewMongo.new(db, "pages")
repo.each db |url, page|
puts page.content_type
end