I'm trying to write from a Spark RDD to MongoDB using the mongo-spark-connector.
I'm facing two problems
- [main problem] I can't connect to Mongo if I define the host according to the documentation (using all instances in the mongo replica set)
- [secondary/related problem] If I connect to the primary only, I can write... but I typically crash the primary writing the first collection
Environment:
- mongo-spark-connector 1.1
- spark 1.6
- scala 2.10.5
First I'll setup a dummy example to demonstrate...
import org.bson.Document
import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.rdd.RDD
/**
* fake json data
*/
val recs: List[String] = List(
"""{"a": 123, "b": 456, "c": "apple"}""",
"""{"a": 345, "b": 72, "c": "banana"}""",
"""{"a": 456, "b": 754, "c": "cat"}""",
"""{"a": 876, "b": 43, "c": "donut"}""",
"""{"a": 432, "b": 234, "c": "existential"}"""
)
val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))
Some values that won't change...
// credentials
val user = ???
val pwd = ???
// fixed values
val db = "db_name"
val replset = "replset_name"
val collection_name = "collection_name"
Here's what does NOT work... in this case "url" would look something like machine.unix.domain.org
and "ip" would look like... well, an IP address.
This is how the documentation says to define the host... with every machine in the replica set.
val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"
I can't get either of these to work. Using every permutation I can think of for the uri...
val uri = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}@${host}" // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}@${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above
EDIT more detail on the error messages.. the errors take to forms...
form 1
typically includes java.net.UnknownHostException: machine.unix.domain.org
also, comes back with server addresses in url form even when defined as IP addresses
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}]
form 2
(authentication error... though connecting with same credentials to primary only works fine)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,
type=UNKNOWN, state=CONNECTING, exception=
{com.mongodb.MongoSecurityException: Exception authenticating
MongoCredential{mechanism=null, userName='xx', source='admin', password=
<hidden>, mechanismProperties={}}}, caused by
{com.mongodb.MongoCommandException: Command failed with error 18:
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...
end EDIT
here's what DOES work... on the dummy data only... more on that below...
val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}@${host}/${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri" -> uri,
"database" -> db,
"collection" -> collection_name,
"replicaSet" -> replset))
// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)
This... connecting to primary only... works great for dummy data, but crashes the primary for real data (50 - 100GB from and RDD with 2700 partitions). My guess is that it opens up too many connections at once... it looks like it opens ~900 connections to write (this jives since default parallelism 2700 based on 900 virtual cores and parellelism factor of 3x).
I'm guessing if I repartition so it opens fewer connections, I'll have better luck... but I'm guessing this also ties in to writing to the primary only instead of spreading it over all instances.
I've read everything I can find here... but most examples are for single instance connections... https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration