0

I have different instances of MongoDB on the same machine, and all of these are pointing to partitions that are created in memory with Linux, in a sort of:

mount -t ramfs -o size=8000M ramfs /mongo/ramdata<n>/

with this configuration for each <n> (e.g.: 1) MongoDB instance:

dbpath=/mongo/ramdata1/
nojournal = true
smallFiles = true
noprealloc = true

Those instances have exactly the same data, and I am just using purely MongoDB-Java driver to geo-query those data which are meant to be read-only (no MongoDB-hadoop or Stratio or whatsover).

So at some point I would like that my Spark process to finish with something like:

...foreach(query_a_specific-mongo_instance_for_a_specific_port)

as the MongoDB instances will run at the same address but different ports.

Given that I don't want to create a MongoDB replica set with one or more Mongo-conf instances, is it possible "partition" the process-flow with Spark in a way that, for example, every single "core/partition" point to a specific Mongoldb port?

For example if I have 100 cores, the first "core" will point to mongo-address:30001, and the 100th core will point to mongo-address:30100?

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Randomize
  • 8,651
  • 18
  • 78
  • 133
  • If these run on the same machine and use the same data why run multiple instances at all? – zero323 Feb 20 '16 at 12:05
  • it seems mongo is locking the queries over a specific limit. However it is a temporary configuration, once is locally fixed will be "distributed". – Randomize Feb 20 '16 at 12:41
  • Maybe I am simplifying things a bit but why not simply use round robin over partitions? – zero323 Feb 20 '16 at 12:52
  • How? I was thinking doing some sort of partition like "if one parameter is in this range of values use this instance, otherwise another". – Randomize Feb 20 '16 at 13:18

1 Answers1

1

I would simply use mapPartitionsWithIndex with a small helper:

val mongos: Vector[String] = ??? // Vector("mongodb://mongo-address:30001", ...)

def getClient(mongos: Seq[String])(i: Integer): MongoClient = {
  MongoClient(mongos(i % mongos.size))
}

rdd.mapPartitionsWithIndex((i, iter) => {
  val client = getClient(mongos)(i)
  iter.map(someFunctionsWhichIsUsingTheClient)
})

For other actions or transformations you can get partition / host info using TaskContext. See How to get ID of a map task in Spark?

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935