2

I have a large MongoDB collection, containing more than 2GB of raw data and I use a very simple query to fetch a specific document from the collection by its Id. Document sizes currently range from 10KB to 4MB, and the Id field is defined as an index.

This is the query I'm using (with the mongojs module):

db.collection('categories').find({ id: category_id },
    function(err, docs) {
        callback(err, docs.length ? docs[0] : false);
    }).limit(1);

When I execute this query using MongoDB shell or a GUI such as Robomongo it takes approximately 1ms to fetch the document, no matter what its physical size, but when I execute the exact same query on NodeJS the response time ranges from 2ms to 2s and more depending on the amount of data. I only measure the time it takes to receive a response and even in cases where NodeJS waits for more than 500ms the MongoDB profiler (.explain()) shows it took only a single millisecond to execute the query.

Now, I'm probably doing something wrong but I can't figure out what it is. I'm rather new to NodeJS but I had experience with MongoDB and PHP in the past and I never encountered such performance issues, so I tend to think I'm probably abusing NodeJS in some way.

I also tried profiling using SpyJS on WebStorm, I saw there are a lot of bson.deserialize calls which sums up quickly into a large stack, but I couldn't investigate farther because SpyJS always crashes at this point. Probably related but I still have no idea how to deal with it.

Please advise, any leads will be appreciated.

Edit: This is the result of db.categories.getIndexes():

[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "name" : "_id_",
        "ns" : "my_db.categories"
    },
    {
        "v" : 1,
        "key" : {
            "id" : 1
        },
        "name" : "id_1",
        "ns" : "my_db.categories"
    }
]

I also tried using findOne which made no difference:

db.collection('categories').findOne({ id: category_id },
    function(err, doc) {
        callback(err, doc || false);
    });
iMoses
  • 4,338
  • 1
  • 24
  • 39
  • Any indexes created on the `id` field? – chridam Apr 14 '15 at 14:56
  • Only the default index on `_id` and an index I added on `id` which is a numerical representation of the category id. – iMoses Apr 14 '15 at 15:01
  • Something to try, if you filter out specific fields in your data does it suddenly load fast? Maybe there's some strange conversion node is choking on. – Will Shaver Apr 14 '15 at 16:00
  • Is your document super-nested? – Will Shaver Apr 14 '15 at 16:01
  • It's quite nested, several depths. I like your idea, I'll give it a go. – iMoses Apr 14 '15 at 16:56
  • Heavily nested fields are usually a bad idea. They don't scale well and most of the time can easily be flattened. They tend to be a sign of overnormalization, too. – Markus W Mahlberg Apr 14 '15 at 18:35
  • The idea behind the structure was to get all related data in a single call, so the document is parsed exactly as the frontend expects it to be. I think the problem here is not necessarily the structure but the size. Even if I do split the document into smaller chunks, the final result can still weight 4MB or more. I'm starting to consider the possibility the NodeJS has problems dealing with such large data without creating a bottleneck. Maybe a different approach is required. – iMoses Apr 14 '15 at 19:28
  • @WillShaver There's an array in my document which contains the category products. A product is a simple object, but a category can have more than 2000 products and it needs to support even larger amounts of data. This easily adds up to most of the document's size, up to 98% of the entire document. Even if I do separate it into another collection and merge the data of run-time, it's still too much data and the downsides stays the same. – iMoses Apr 14 '15 at 19:32
  • Can you reproduce the problem by adding one document to another collection and retrieving by `_id`? If so, this sounds like an issue worth [reporting to the driver team](https://jira.mongodb.org/browse/NODE/). Include the reproducible case. – wdberkeley Apr 17 '15 at 14:36

2 Answers2

0

My guess is the .limit(1) is ignored because the callback is provided early. Once find sees a callback it's going to execute the query, and only after the query has been sent to mongo will the .limit modifier try to adjust the query but it's too late. Recode as such and see if that solves it:

db.collection('categories').find({ id: category_id }).limit(1).exec(
  function(err, docs) {
    callback(err, docs.length ? docs[0] : false);
});
Peter Lyons
  • 142,938
  • 30
  • 279
  • 274
  • It doesn't seem to be any difference at all. I also tried `findOne` which should have the same effect. The query is very fast, as is, outside of NodeJS, so I'd guess the problem is somewhere inside NodeJS. – iMoses Apr 14 '15 at 15:41
-1

Most likely you'll need to have a combination of normalized and denormalized data in your object. Sending 4MB across the wire at a time seems pretty heavy, and likely will cause problems for any browser that's going to be doing the parsing of the data.

Most likely you should store the top 100 products, the first page of products, or some smaller subset that makes sense for your application in the category. This may be the top alphabetically, most popular, newest, or some other app-specific metric you determine.

When you go about editing a category, you'll use the $push/$slice method to ensure you avoid unbounded array growth.

Then when you actually page through the results you'll do a separate query to the individual products table by category. (Index that.)

I've written about this before here: https://stackoverflow.com/a/27286612/68567

Community
  • 1
  • 1
Will Shaver
  • 12,471
  • 5
  • 49
  • 64
  • Truth be told, I already took care of the browser. 4MB gzipped is only ~300KB and I have a way of compressing it even more. The browser is not the problem here. TTFB is longer than it takes to download the response, by far! After profiling a bit I found that a 1ms query can sometimes take up to 2 seconds from the moment I initiate the query to the moment I get a response from the MongoDB module. Before I'll consider different solutions to my problem I wish to understand why is it happening, what causes this sort of delay specifically on NodeJS. – iMoses Apr 15 '15 at 04:54