4

I have a shell script that creates a cursor on a collection then updates each docunment with data from another collection.
When I run it on a local db it finishes in about 15 sec, but on a hosted database, it runs for over 45 min.

db.col1.find().forEach(function(doc) {
    db.col2.findAndModify(
        {
            query: {attr1: doc.attr1},
            update: { $push: {attr2: doc.attr2},
            upsert: true
        });
    });  

So there is obviously network overhead between the client and server in order to process this script. Is there a way to keep the processing all server side? I have looked at server side javascript, but from what I read here, it is not a recommended practice.

Community
  • 1
  • 1
SteveO7
  • 2,430
  • 3
  • 32
  • 40
  • Are you sure it's network? Have you looked at the traffic? Are both databases identical (especially is indexing is the same?) – womp Oct 26 '15 at 16:35
  • DB are the same, my testing was after a dump from the hosted db and restore to local. Indexes identical. I have watched the traffic in OSX Activity Monitor and there is a significant amount, enough that I assume the cursor is bringing all the data across the network. – SteveO7 Oct 26 '15 at 17:12

1 Answers1

4

Locally, you have almost no network overhead. No interference, no routers, no switches, no limited bandwidth. Plus, during most situations, your mass storage, be it SSD or HDD, more or less idles around (Unless you tend to play games while developing.) So when a operation requiring a lot of IO capabilities kicks in, it is available.

When you run your script from your local shell against a server, here is what happens.

  1. db.col1.find().forEach The whole collection is going to be read, from an unknown medium (most likely HDDs of which the available IO could be shared among many instances). The documents will then be transferred to you local shell. Compared to a connection to localhost, each document retrieval is routed over dozens of hops, each adding a tiny amount of latency. Over presumably quite some documents, this adds up. Don't forget that the complete document is sent over the network, since you did not use projection to limit the fields returned to attr1 and attr2. External bandwidth of course is slower than a connection to localhost.
  2. db.col2.findAndModify For each document, a query is done. Again, the shared IO might well kill performance.
  3. { query: {attr1: doc.attr1}, update: { $push: {attr2: doc.attr2}, upsert: true} Are you sure attr1 is indexed, by the way? And even when it is, it is not sure wether the index is currently in RAM. We are talking of a shared instance, right? And it might well be that your write operations have to wait until they are even processed by mongod, as per default write concern, the data has to be successfully applied to the in memory data set before it is acknowledged, but if gazillions of operations are sent to the shared instance, it might well be that your operation is number 1 bazillion and one in the queue. And the network latency is added a second time, since the values transferred to your local shell needs to be sent back.

What you can do

First of all, make sure that you

  • limit the returned values to those you need using projection:

    db.col1.find({},{ "_id":0, "attr1":1, "attr2":1 })
    
  • Make sure you have attr1 indexed

    db.col2.ensureIndex( { "attr1":1 } )
    
  • Use bulk operations. They are executed much faster, at the expense of reduced feedback in case of problems.

    // We can use unordered here, because the operations
    // each apply to only a single document
    var bulk = db.col2.initializeUnorderedBulkOp()
    
    // A counter we will use for intermediate commits
    // We do the intermediate commits in order to keep RAM usage low
    var counter = 0
    
    // We limit the result to the values we need
    db.col1.find({}.{"_id":0, "attr1":1, "attr2":1 }).forEach(
      function(doc){
    
        // Find the matching document
        // Update exactly that
        // and if it does not exist, create it
        bulk
          .find({"attr1": doc.attr1})
          .updateOne({ $push: {"attr2": doc.attr2})
          .upsert()
    
        counter++
    
        // We have queued 1k operations and can commit them
        // MongoDB would split the bulk ops in batches of 1k operations anyway
        if( counter%1000 == 0 ){
          bulk.execute()
          print("Operations committed: "+counter)
          // Initialize a new batch of operations
          bulk = db.col2.initializeUnorderedBulkOp()
        }
    
      }
    )
    // Execute the remaining operations not committed yet.
    bulk.execute()
    print("Operations committed: "+counter)
    
Markus W Mahlberg
  • 19,711
  • 6
  • 65
  • 89
  • Thanks for the thorough answer Markus, especially for introducing me to bulk. I was not aware of the Bulk methods, but I am reading about them now. I guess the short answer to "can you force server side processing" is "no", but there are mitigations. – SteveO7 Oct 27 '15 at 01:30
  • 1
    @SteveO7 Sort of. There are things you can do server-side using [the aggregation framework](https://docs.mongodb.org/manual/aggregation/), but merging values of two different collections isn't one of them. – Markus W Mahlberg Oct 27 '15 at 05:30
  • In case it will help someone else, the Bulk methods are not available in Robomongo 0.8.5, you need to use a standard shell. It is supposed to be added in the upcoming v9. Also for comparison, my bulk process runs in 48 seconds compared to over 45 minutes without. – SteveO7 Oct 27 '15 at 17:35
  • @SteveO7 Keep in mind that this comes with a drastically reduced fault tolerance and much less granular error reporting. – Markus W Mahlberg Oct 27 '15 at 18:23
  • Understood, in this case, my process is idempotent and the affected data is discardable. Would not be good for financial transactions! – SteveO7 Oct 27 '15 at 19:24