Locally, you have almost no network overhead. No interference, no routers, no switches, no limited bandwidth. Plus, during most situations, your mass storage, be it SSD or HDD, more or less idles around (Unless you tend to play games while developing.) So when a operation requiring a lot of IO capabilities kicks in, it is available.
When you run your script from your local shell against a server, here is what happens.
db.col1.find().forEach
The whole collection is going to be read, from an unknown medium (most likely HDDs of which the available IO could be shared among many instances). The documents will then be transferred to you local shell. Compared to a connection to localhost, each document retrieval is routed over dozens of hops, each adding a tiny amount of latency. Over presumably quite some documents, this adds up. Don't forget that the complete document is sent over the network, since you did not use projection to limit the fields returned to attr1
and attr2
. External bandwidth of course is slower than a connection to localhost
.
db.col2.findAndModify
For each document, a query is done. Again, the shared IO might well kill performance.
{ query: {attr1: doc.attr1}, update: { $push: {attr2: doc.attr2}, upsert: true}
Are you sure attr1
is indexed, by the way? And even when it is, it is not sure wether the index is currently in RAM. We are talking of a shared instance, right? And it might well be that your write operations have to wait until they are even processed by mongod
, as per default write concern, the data has to be successfully applied to the in memory data set before it is acknowledged, but if gazillions of operations are sent to the shared instance, it might well be that your operation is number 1 bazillion and one in the queue. And the network latency is added a second time, since the values transferred to your local shell needs to be sent back.
What you can do
First of all, make sure that you
limit the returned values to those you need using projection:
db.col1.find({},{ "_id":0, "attr1":1, "attr2":1 })
Make sure you have attr1
indexed
db.col2.ensureIndex( { "attr1":1 } )
Use bulk operations. They are executed much faster, at the expense of reduced feedback in case of problems.
// We can use unordered here, because the operations
// each apply to only a single document
var bulk = db.col2.initializeUnorderedBulkOp()
// A counter we will use for intermediate commits
// We do the intermediate commits in order to keep RAM usage low
var counter = 0
// We limit the result to the values we need
db.col1.find({}.{"_id":0, "attr1":1, "attr2":1 }).forEach(
function(doc){
// Find the matching document
// Update exactly that
// and if it does not exist, create it
bulk
.find({"attr1": doc.attr1})
.updateOne({ $push: {"attr2": doc.attr2})
.upsert()
counter++
// We have queued 1k operations and can commit them
// MongoDB would split the bulk ops in batches of 1k operations anyway
if( counter%1000 == 0 ){
bulk.execute()
print("Operations committed: "+counter)
// Initialize a new batch of operations
bulk = db.col2.initializeUnorderedBulkOp()
}
}
)
// Execute the remaining operations not committed yet.
bulk.execute()
print("Operations committed: "+counter)