Read full collection through spark mongo connector with sequential disk access?

Asked Nov 30 '20 at 17:12

Active Dec 01 '20 at 23:56

Viewed 337 times

I want to read a full MongoDB collection into Spark using the Mongo Spark connector (Scala API) as efficiently as possible in terms of disk I/O.

After reading the connector docs and code, I understand that the partitioners are all designed to compute the minimum and maximum boundaries of an indexed field. My understanding is (and my tests using explain show) that each cursor will scan the index for document keys within the computed boundaries and then fetch the corresponding documents.

My concern is that this index-scan approach will result in random disk reads, and ultimately more I/Ops then necessary. In my case, the problem is accentuated because the collection is larger than available RAM (I know that's not recommended). Wouldn't it be orders of magnitudes faster to use a natural order cursor to read the documents as they are stored on disk? How can I accomplish this?

edited Dec 01 '20 at 23:56

asked Nov 30 '20 at 17:12

Mike Trotta

What do indexed fields and boundaries have to do with collection scans? – D. SM Dec 01 '20 at 00:29
@D.SM I'm describing the implementation of the [Mongo Spark Connector](https://docs.mongodb.com/spark-connector/master/configuration#partitioner-configuration). Their recommended approach is to use a partition key on an indexed field to compute RDD partition boundaries. I believe that approach will result in an index scan and non-sequential disk reads. – Mike Trotta Dec 02 '20 at 00:06
Define one partition then? – D. SM Dec 02 '20 at 00:12

Read full collection through spark mongo connector with sequential disk access?

0 Answers0