Paging Large Datasets

Question

I have a large collection of data that I want to write a script against to read then process; in my case grabbing some fields and sending to a RESTful API.

To spare the load, I wanted to use limit and skip to paginate the data I retrieve and have that in while loop, however since it's nodejs, I have to use callbacks.

What is the best way to handle reading large amounts of data in nodejs/mongo without crashing/timing out?

If you can deal with only "moving forward" in the results then a basic approach is explained [here](http://stackoverflow.com/a/31243398/5031275). If you need to "go to" a page then you are stuck with `.skip()` and `.limit()` or build a cache. But the last has it's own cost as well. — Blakes Seven, Jul 07 '15 at 02:52

score 1 · Accepted Answer · edited May 23 '17 at 11:58

(I am assuming your documents don't need to be processed in any specific order.)

Forget about skip as that is an expensive operation. From the official documentation:

The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

Forward paging as suggested in the answer shared by Blakes Seven is a good choice. However, experiences with it may not be very pleasant since you need to track pagination with asynchronicity and unless your code is short and neat, it's easy to get entangled in irritable debugging hours.

To keep things most flexible and not resort to sorting unnecessarily, just take away chunks of data of a configurable size from the main collection, process them, and dump them into a secondary collection. If your processing time per chunk is high, then instead of storing directly to another collection, store the documents in a temporary collection, process it, then dump the entire (temporary) collection to the secondary collection(or just delete the documents if you don't need them. This is what I'd do. After keeping a backup of the primary collection, though.)

This has more benefits:

More error-resistant, because you don't have to handle page/chunk numbers.
Robust, because even if something goes wrong during an iteration, you don't lose the work done for the prior chunks. You only need to restart the current iteration.
Flexible/scalable, since you can configure the chunk size between any two iterations and increase it or decrease based on how slow or fast the processing is happening. Additionally, you can spread the processing over a large timespan - save the results upto a certain time, then take a break or a vacation, and resume when you return! Also, you can distribute the load to a number of worker processes to speed things up.

Good luck!

I like both answers, though I'm partial to @BlakesSeven answer. Though I would optimize it with a processing function and pass in the last document id that was retrieved to the function, making the call recursive. — Cole, Jul 08 '15 at 14:23

Paging Large Datasets

1 Answers1