Best way to query entire MongoDB collection for ETL

Question

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).

We're wondering what's the best approach for this:

A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)

We're unsure what's the best practice and will yield the best results with minimum impact on performance.

No, we're migrating data from a source DB to a new target DB through a 3rd server for the ETL. We are looking for the most efficient way to load the entire collection without hurting the source's performance — Ron, Jul 17 '16 at 11:07
Ron, please be more specific. Are you migrating to a new DBMS, or another instance of MongoDB. In case of the former, please state the DBMS you are migrating to and maybe even why (maybe we can save you a lot of trouble and pita). — Markus W Mahlberg, Jul 17 '16 at 11:17
@MarkusWMahlberg - we are moving from MongoDB to MongoDB, the writing side is less relevant - we are reading the entire collection from a remote server and we're looking for the method that has as little effect on performance as possible, as this is a live production environment — Ron, Jul 17 '16 at 11:48

score 1 · Answer 1 · answered Jul 17 '16 at 11:09

1

I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.

The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

So if possible build your pages on an indexed field and process those batches of data with an according query range.

answered Jul 17 '16 at 11:09

DAXaholic

33,312
6
76
74

1

how would I go about adding pagination without using limit & skip? and yes, we are running on an indexed field – Ron Jul 17 '16 at 11:49
Well e.g. if your indexed field is the creation timestamp of the documents you could process day by day or hour by hour depending on how big those chunks get. You may also use limit / skip in those dedicated chunks (e.g. when one day is extremely large) bc then you will only skip some docs from the beginning of that chunk, not from the whole collection. – DAXaholic Jul 17 '16 at 11:55
It seems to really be bad that MongoDB doesn't know where its records end and can't just skip by offset directly. – Anatoly Alekseev May 05 '18 at 09:28

score 1 · Answer 2 · answered Jul 17 '16 at 12:53

The brutal way

Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of

var docs = db.yourcoll.find()
docs.forEach(
  function(doc){
    //whatever
  }
)

will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.

As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.

Alternate approach

Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.

When doing ETL from MongoDB to MongoDB, I usually proceed as follows:

ET…

Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.

Even collection spanning transformations can be achieved, using the $lookup stage.

…L

After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:

Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.

As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.

score 1 · Answer 3 · answered Oct 05 '17 at 18:10

1

MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.

This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.

Please see the documentation here: https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/

answered Oct 05 '17 at 18:10

Vijay Mohan

1,056
14
34

Not for WiredTiger, unfortunately ( – Anatoly Alekseev May 31 '18 at 23:56

Best way to query entire MongoDB collection for ETL

3 Answers3

The brutal way

Alternate approach

ET…

…L