2

I want to read a huge MongoDB collection from Spark create an persistent RDD and do further data analysis on it.

Is there any way I can read the data from MongoDB faster. Have tried with the approach of MongoDB Java + Casbah

Can I use the worker/slave to read data in parallel from MongoDB and then save it as persistent data and use it.

Ajay Gupta
  • 3,192
  • 1
  • 22
  • 30
  • 1
    Using MongoDB with Hadoop Spark: [Part 1](https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-introduction-setup), [Part 2](https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive-example), [Part 3](https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways). – zero323 Sep 09 '15 at 02:59
  • @zero323 Have tried with it but still it is accessing at slow speed - only master is trying to get the data from mongodb and none of the workers are doing anything. – Ajay Gupta Sep 09 '15 at 04:26
  • 1
    @imaGin - can you share your code where you tried this approach and it didn't work? – Holden Sep 09 '15 at 08:27
  • @Holden : My bad was running the code on local instead on a cluster. And thanks for the book "Learning Spark" :) – Ajay Gupta Sep 09 '15 at 18:53

1 Answers1

2

There are two ways of getting the data from MongoDB to Apache Spark.

Method 1: Using Casbah (Layer on MongDB Java Driver)

val uriRemote = MongoClientURI("mongodb://RemoteURL:27017/")
val mongoClientRemote =  MongoClient(uriRemote)
val dbRemote = mongoClientRemote("dbName")
val collectionRemote = dbRemote("collectionName")
val ipMongo = collectionRemote.find
val ipRDD = sc.makeRDD(ipMongo.toList)
ipRDD.saveAsTextFile("hdfs://path/to/hdfs")

Over here we are using Scala and Casbah to get the data first and then save it to HDFS.

Method 2: Spark Worker at our use

Better version of code: Using Spark worker and multiple core to use to get the data in short time.

val config = new Configuration()
config.set("mongo.job.input.format","com.mongodb.hadoop.MongoInputFormat")
config.set("mongo.input.uri", "mongodb://RemoteURL:27017/dbName.collectionName")
val keyClassName = classOf[Object]
val valueClassName = classOf[BSONObject]
val inputFormatClassName = classOf[com.mongodb.hadoop.MongoInputFormat]
val ipRDD = sc.newAPIHadoopRDD(config,inputFormatClassName,keyClassName,valueClassName)
ipRDD.saveAsTextFile("hdfs://path/to/hdfs") 
Ajay Gupta
  • 3,192
  • 1
  • 22
  • 30
  • 1
    To add onto it to get a mongodb query in Spark MongoDB Hadoop Connector so as to get the queried data instead of whole collection we can use >> config.set("mongo.input.query","{'source':'xyz'}") – Ajay Gupta Sep 10 '15 at 04:21
  • Hi, I'm also facing troubles with reading a lot of data from MongoDB. Does Casbah perform better than the mongo-java-driver? – Kent Sep 13 '15 at 17:04
  • Casbah works using single processor so read and write time is slow but Mongo HadoopConnector gets the data using multiprocessing and multi computing so we can get the data and store the data in less time. – Ajay Gupta Sep 14 '15 at 00:45