9

I'm looking for tips on how to improve the database performance in the following situation.

As a sample application, I wrote a fairly simple app today that uses the Twitter streaming API to search for certain keywords, then I am storing the results in MongoDB. The app is written with Node.js.

I'm storing 2 collections. One stores the keyword and an array of tweet id's that reference each tweet found mentioning that keyword. These are being added to the database using .update() with {upsert:true} so that new id's are appended to the 'ids' array.

A sample document from this collection looks like this:

{ "_id": ObjectId("4e00645ef58a7ad3fc9fd9f9"), "ids": ["id1","id2","id3"], "keyword": "#chocolate" }

Update code:

 keywords.update({keyword: key_word},{$push:{ids: id}},{upsert:true}, function(err){})

The 2nd collection looks like this and are added simply by using .save()

 {
     "twt_id": "id1",
     "tweet": { //big chunk of json that doesn't need to be shown }
 }

I've got this running on my Macbook right now and its been going for about 2 hours. I'm storing a lot of data, probably several hundred documents per minute. Right now the number of objects in Mongodb is 120k+.

What I'm noticing is that the cpu usage for the database process is hitting as high as 84% and has been constantly going up gradually since I started the latest test run.

I was reading up on setting indexes, but since I'm adding documents and not running queries against them, I'm not sure if indexes will help. A side thought that occurred to me is that update() might be doing a lookup since I'm using $push and that an index might help with that.

What should I be looking at to keep MongoDB from eating up ever increasing amounts of CPU?

Geuis
  • 41,122
  • 56
  • 157
  • 219
  • 1
    A MacBook with a slow disk is unlikely the right choice for performing benchmarks and speaking of performance problems..even with the newest MacBook hardware –  Jun 21 '11 at 12:19
  • @Blackmoon The accepted answer suggests otherwise. –  Aug 13 '11 at 22:39

3 Answers3

12

It is very likely that you are hitting a very common bottle neck in MongoDB. Since you are updating documents very frequently by adding strings, there is a good chance that you are running out of space for that document and forcing the database to constantly move that document to a different space in memory\disk by rewriting it at the tail end of the data file.

Adding indexes can only hurt write performance so that will not help improve performance unless you are read heavy.

I would consider changing your application logic to do this:

  1. Index on the keyword field
  2. Before inserting anything into the database each time you detect a tweet, query for the document which contains the keyword. If it does not exist, insert a new document but pad the ids property by adding a whole bunch of fake strings in the array. Then immediately after inserting it, remove all of the id's from that array. This will cause mongodb to allocate additional room for that entire document so that when you start adding id's to the ids field, it will have plenty of room to grow.
  3. Insert the id of the tweet into the ids field
Bryan Migliorisi
  • 8,982
  • 4
  • 34
  • 47
  • 3
    I would follow the recommendation on the keyword field, and also heed the warnings of document relocation. In modern versions, you can specify the padding factor in the collection, to automatically make more room for each entry. – pestilence669 Apr 10 '13 at 06:33
  • @pestilence669 I thought padding factor is calculated dynamically - you can't set it. Do you know a way? – Michael Spector Nov 18 '14 at 13:49
  • @spektom right, which i was suggesting simulating a custom padding factor with whitespace or some such junk data – pestilence669 Nov 20 '14 at 00:14
9

You're on the right path. The query portion of your update needs an index, else it is running a table scan. An indent index on keyword and you'll see update performance increase significantly.

mbarthelemy
  • 12,465
  • 4
  • 41
  • 43
Brendan W. McAdams
  • 9,379
  • 3
  • 41
  • 31
  • If you plan not only offline full table scan processing you should add all your indexes for searches now so you'll measure the practical performance of inserts. – Karoly Horvath Jun 21 '11 at 12:10
  • Thanks Brendan. Took me a couple days to get back to this. I ran the app for a good hour to get the cpu use back up. I stopped the app, added the index, and restarted. Now mongo is using 1.0-1.4% cpu. I have to let it run for a while to see what the long term performance is, but this was a huge benefit. Thanks. – Geuis Jun 25 '11 at 00:19