(I am assuming your documents don't need to be processed in any specific order.)
Forget about skip
as that is an expensive operation. From the official documentation:
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
Forward paging as suggested in the answer shared by Blakes Seven is a good choice. However, experiences with it may not be very pleasant since you need to track pagination with asynchronicity and unless your code is short and neat, it's easy to get entangled in irritable debugging hours.
To keep things most flexible and not resort to sorting unnecessarily, just take away chunks of data of a configurable size from the main collection, process them, and dump them into a secondary collection. If your processing time per chunk is high, then instead of storing directly to another collection, store the documents in a temporary collection, process it, then dump the entire (temporary) collection to the secondary collection(or just delete the documents if you don't need them. This is what I'd do. After keeping a backup of the primary collection, though.)
This has more benefits:
- More error-resistant, because you don't have to handle page/chunk numbers.
- Robust, because even if something goes wrong during an iteration, you don't lose the work done for the prior chunks. You only need to restart the current iteration.
- Flexible/scalable, since you can configure the chunk size between any two iterations and increase it or decrease based on how slow or fast the processing is happening. Additionally, you can spread the processing over a large timespan - save the results upto a certain time, then take a break or a vacation, and resume when you return! Also, you can distribute the load to a number of worker processes to speed things up.
Good luck!