2

I have two mongo databases on different machines, host1:27017/db1 and host2:27017/db2 with identical collection item on both the databases. How do I copy selected data, say

db1.item.find({"date": { $gte : "2016-03-15" }})

from db1.item to db2.item using mongo shell. I do not want to clone collection (because they are huge), but copy just the selected data.

Vikrant Goel
  • 654
  • 6
  • 20
  • Note: I want to move from one host to another, so this related answer might help, but won't work as it is. http://stackoverflow.com/questions/9711529/save-subset-of-mongodb-collection-to-another-collection – Vikrant Goel Mar 25 '16 at 20:50
  • 1
    Just use `mongodump` and `mongorestore`. There is a [`--query`](https://docs.mongodb.org/manual/reference/program/mongodump/#cmdoption--query) option and you can target to collection. Also the process makes it simple to authenticate with target and source. And in modern releases you can use compression for communicating between hosts. – Blakes Seven Mar 25 '16 at 20:57
  • Also your query to `"date"` is using a "string". Is that property actually a "string" or is it really a BSON `Date`? You need the correct type to query. And if they are strings you really should convert them. Less space, more useful. – Blakes Seven Mar 25 '16 at 20:59
  • Its a string, and yes I agree if I had the choice I would change it to `Date`. Also, `mongodump` and `mongorestore` is a good idea and will work, although I am trying to look for something that wouldn't need so much work as I might have to do it often. Of course I can automate it but then I'll be worried about its portability. – Vikrant Goel Mar 25 '16 at 21:02
  • 1
    That's why I say use the external utilities. In the real world you will want to authenticate, and thus it's probably the sanest option. The mongo shell is not meant to be a "coding" environment. So your real choices come down to 1. Script to work with `mongodump` and `mongorestore`. The actual command should be a "one liner" with a "pipe" `|` between the two. 2. Write an actual program with a real language, but even that won't be able to use the data compression without "real low level programming". It's really a no brainer when you consider it. – Blakes Seven Mar 25 '16 at 21:10
  • I could actually find a way to do it through just the shell. Authentication wasn't an issue for me right now but I don't think it would had affected the solution. – Vikrant Goel Mar 25 '16 at 22:11
  • I didn't say it wasn't possible. Just that is is not the "best" way to do it. Neither is your solution really, since you basically are sending and waiting for confirmation *'one document at a time"*. There are far better approaches in code. But as I said, using the utilities "designed" to to this already is the best option without wrting a lot of code. – Blakes Seven Mar 26 '16 at 01:33

2 Answers2

6

So while it it "possible" to use the shell ( and no-one said it wasn't ) it's just not the "best" way.

Dump and Restore with query

The "best" approach is using mongodump and mongorestore. You don't need "temporary dump files" either. It's just a matter of "piping" output from one into the other:

Depending on which host you actually run this from as to where you put the -h option:

mongodump -h host2 -d db2 -c item \
   --query '{ "date": { "$gte": "2016-03-15" } }' \
   --out - \
   | mongorestore -d db1 -c item -

From MongoDB 3.2 releases these commands can use compressed data as well. This needs the --gzip and --archive options:

mongodump -h host2 -d db2 -c item \
   --query '{ "date": { "$gte": "2016-03-15" } }' \
   --gzip --archive \
   | mongorestore -d db1 -c item --gzip --archive

That's always the fastest way to move things between databases and especially between hosts.

Using the shell

If you are insistent on writing this in the shell, then you should at least get it right.

Of course you can use the connect() or Mongo() methods to refernce the remote connection, but that is really only part of the story, since once connected you still need to handle this efficiently.

The best way to do this is use "Bulk Operations", as this removes the overhead of request and acknowledgement for every new .insert() operation with the target server and collection. It's going to reduce a lot of time, though still not as efficient as the use of utilities above:

Modern MongoDB 3.2 has bulkWrite():

var db2 = connect('host2/db2');
var operations = [];

db2.item.find({ "date": { "$gte": "2016-03-15" } }).forEach(function(doc) {
    operations.push({ "insertOne": { "document": doc } });

    // Actually only write every 1000 entries at once
    if ( operations.length == 1000 ) {
        db.item.bulkWrite(operations,{ "ordered": false })
        operations = [];
    }
});

// Write any remaining 
if ( operations.length > 0 ) {
    db.item.bulkWrite(operations,{ "ordered": false });
}

For MongoDB 2.6 releases there is another "bulk" constructor:

var db2 = connect('host2/db2');
var bulk = db.item.initializeUnorderedBulkOp();
var count = 0;

db2.item.find({ "date": { "$gte": "2016-03-15" } }).forEach(function(doc) {
    bulk.insert(doc);
    count++;

    if ( count % 1000 == 0 ) {
        bulk.execute();
        bulk = db.item.initializeUnorderedBulkOp();
    }
});

if ( count % 1000 != 0 ) {
    bulk.execute();
}

Of course the newer method is really just calling the same underlying "older" methods underneath. But the main point is for consistency in other API's, where quite often the point is to "downgrade" the operations when working with a server version less than MongoDB 2.6 that has no "Bulk Operations" wire protocol, and then just handles the loop and commit of each operation in the batch for you.

In either case the "unordered" approach is best, since the operations are in fact committed on the server in "parallel" instead of "serially", which means multiple things are actually writing at the same time.

Conclusion

So really, all of this is how the code is implemented in external utilities anyway, and actually in a more organized and "low level" form. Naturally the "shell" does not compress data "over the wire" with comunication between hosts, nor does it have access to the "low level" write functions you could do with a BSON library and low level code, that both work much faster.

The "dump and restore" actually can work directly with a compressed BSON form of the data and commits the writes in a very efficient way. By that token, it is your best option for doing this rather than coding the implementation yourself.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135
  • Damn, this was soo helpful! +1 – Vikrant Goel Mar 28 '16 at 17:27
  • I went through the mongodump and restore route, and the collections I have are huge so its taking a while. The continuous stream of output its showing is like "2016-03-28T18:22:03.198-0700 db.item 0.0 B". What does the 0.0 B mean here? – Vikrant Goel Mar 29 '16 at 01:24
  • Also, for piping the mongodump and restore, I found that if the dump is huge, as in several gigabytes huge, it consumes a lot of memory on the machine (all of it in my case) : – Vikrant Goel Mar 29 '16 at 01:48
  • @VikrantGoel Well it "is" on `STDIN` and `STDOUT` so that should only be expected. For better results you would "ideally" use a remote socket linked as a "file", but it still does not unfortunately read as a "stream". It's really just an example point for the general case. I'll get around to updating since actually the `--gzip` and `--archive` options don't fully work as documented. But that's another "tuit" which will probably take me a few hours to get to doing. Bulk examples work as advertised though. So there's still that :) – Blakes Seven Mar 29 '16 at 02:52
0

I do feel that mongodump & mongorestore is the more pervasive way to do it. Although, I was able to find a way to do it all through the mongo shell (avoiding any temporary dump files) which is what I was looking for.

[user@host1 ~]$ mongo
use db1;

var host2db2 = connect("host2:27017/db2")

host2db2.item.find({
    "date" : { $gte : "2016-03-15"}
}).forEach(function(doc){
    db1.item.insert(doc);
});

Credits to : Save Subset of MongoDB Collection to Another Collection

Community
  • 1
  • 1
Vikrant Goel
  • 654
  • 6
  • 20