3

I have a large dump file created by mongodump utility, for example "test.dump". I want get one exact collection from this dump, and manually read it into memory for further processing as valid BSON documents. I cannot load full dump in memory due to it's size.

I do not need physically restore anything to mongo instances! I basically even have none of them up and running. So mongorestore utility could be a solution only if can help me to read my collection from a dump file to memory.

I'm using Python 3 and pymongo, but can import another third-party libs if necessary or launch any CLI utilities with stdout results.

Nikolay Prokopyev
  • 1,260
  • 12
  • 22

3 Answers3

3

The mongodump files are just a bunch of BSON strings representing the documents from a collection.

import gzip, bson # bson package is from the pymongo library

with gzip.open('dump/test/hello.bson.gz', mode='rb') as f:
    for doc in bson.decode_file_iter(f):
        print(doc)

Documentation of bson.decode_file_iter(): https://pymongo.readthedocs.io/en/stable/api/bson/index.html#bson.decode_file_iter

Messa
  • 24,321
  • 6
  • 68
  • 92
1

I am unfamiliar with any off-the-shelf tools that would extract a collection out of a dump file. That said:

  • AWS offers x1e.32xlarge instance type with almost 4 TB of memory. How big is your dump exactly?
  • Surely the easiest solution is to just load the dump into a MongoDB deployment (which doesn't need much memory or other resources, if you are going to dump one collection back). Hardware is very cheap these days.
  • The BSON format is not that complicated. I expect you'd need to write the tooling for this yourself but if the dump is in fact valid BSON you can manually traverse it using BSON reading code that is part of every MongoDB driver.
D. SM
  • 13,584
  • 3
  • 12
  • 21
  • Yeah, I have been trying to traverse, but with no success yet. There is smth with --archive and --gzip arguments of mongodump, so my binary data is not valid BSON. I tried to find specification for results of mongodump operation, but have also found nothing. About hard-drive space - that not an issue, you're right. But I need repeatable solution for any collection having just dumps and python script (or with another tools). One time I can do it with mongo+mongorestore by hands and copy-paste, literally. – Nikolay Prokopyev Apr 28 '20 at 11:48
  • 1
    I don't know that there is a specification either, your best bet is probably to read the source. – D. SM Apr 28 '20 at 18:41
0

Use the --nsInclude flag on mongorestore to only restore the one collection you are interested in. e.g.

mongorestore --nsInclude=<DatabaseName>.<CollectionName>
Belly Buster
  • 8,224
  • 2
  • 7
  • 20
  • I familiar with --nsInclude, but need get data in memory or at least, stdout. mongorestore with --nsInclude will apply the data to actual mongo instance, but I do not need this. I have workaround with launching mongo instance, but it's very unconvienient for me. – Nikolay Prokopyev Apr 28 '20 at 11:39
  • To avoid insanity just launch a mongo instance. It's one command with docker. – Belly Buster Apr 28 '20 at 12:09