created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):
- Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
- Indexing before inserting It would take a few days to insert all data to collection.
My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.
Now, time for a question:
I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?