0

I want to read a full MongoDB collection into Spark using the Mongo Spark connector (Scala API) as efficiently as possible in terms of disk I/O.

After reading the connector docs and code, I understand that the partitioners are all designed to compute the minimum and maximum boundaries of an indexed field. My understanding is (and my tests using explain show) that each cursor will scan the index for document keys within the computed boundaries and then fetch the corresponding documents.

My concern is that this index-scan approach will result in random disk reads, and ultimately more I/Ops then necessary. In my case, the problem is accentuated because the collection is larger than available RAM (I know that's not recommended). Wouldn't it be orders of magnitudes faster to use a natural order cursor to read the documents as they are stored on disk? How can I accomplish this?

Mike Trotta
  • 546
  • 4
  • 6
  • What do indexed fields and boundaries have to do with collection scans? – D. SM Dec 01 '20 at 00:29
  • @D.SM I'm describing the implementation of the [Mongo Spark Connector](https://docs.mongodb.com/spark-connector/master/configuration#partitioner-configuration). Their recommended approach is to use a partition key on an indexed field to compute RDD partition boundaries. I believe that approach will result in an index scan and non-sequential disk reads. – Mike Trotta Dec 02 '20 at 00:06
  • Define one partition then? – D. SM Dec 02 '20 at 00:12

0 Answers0