Spark: How to separate data processing to point to different Mongodb instances that run in memory on RamFS/TmpFS partitions?

Question

I have different instances of MongoDB on the same machine, and all of these are pointing to partitions that are created in memory with Linux, in a sort of:

mount -t ramfs -o size=8000M ramfs /mongo/ramdata<n>/

with this configuration for each <n> (e.g.: 1) MongoDB instance:

dbpath=/mongo/ramdata1/
nojournal = true
smallFiles = true
noprealloc = true

Those instances have exactly the same data, and I am just using purely MongoDB-Java driver to geo-query those data which are meant to be read-only (no MongoDB-hadoop or Stratio or whatsover).

So at some point I would like that my Spark process to finish with something like:

...foreach(query_a_specific-mongo_instance_for_a_specific_port)

as the MongoDB instances will run at the same address but different ports.

Given that I don't want to create a MongoDB replica set with one or more Mongo-conf instances, is it possible "partition" the process-flow with Spark in a way that, for example, every single "core/partition" point to a specific Mongoldb port?

For example if I have 100 cores, the first "core" will point to mongo-address:30001, and the 100th core will point to mongo-address:30100?

If these run on the same machine and use the same data why run multiple instances at all? — zero323, Feb 20 '16 at 12:05
it seems mongo is locking the queries over a specific limit. However it is a temporary configuration, once is locally fixed will be "distributed". — Randomize, Feb 20 '16 at 12:41
Maybe I am simplifying things a bit but why not simply use round robin over partitions? — zero323, Feb 20 '16 at 12:52
How? I was thinking doing some sort of partition like "if one parameter is in this range of values use this instance, otherwise another". — Randomize, Feb 20 '16 at 13:18

score 1 · Accepted Answer · edited May 23 '17 at 12:23

I would simply use mapPartitionsWithIndex with a small helper:

val mongos: Vector[String] = ??? // Vector("mongodb://mongo-address:30001", ...)

def getClient(mongos: Seq[String])(i: Integer): MongoClient = {
  MongoClient(mongos(i % mongos.size))
}

rdd.mapPartitionsWithIndex((i, iter) => {
  val client = getClient(mongos)(i)
  iter.map(someFunctionsWhichIsUsingTheClient)
})

For other actions or transformations you can get partition / host info using TaskContext. See How to get ID of a map task in Spark?

Spark: How to separate data processing to point to different Mongodb instances that run in memory on RamFS/TmpFS partitions?

1 Answers1