0

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -

spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!@docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3

I'm a beginner in Spark & AWS. Can anyone please help?

1 Answers1

1

DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1

Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.

spark-submit 
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3  
--conf "spark.executor.extraJavaOptions=  
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks  
-Djavax.net.ssl.trustStorePassword=<yourpassword>"   pytest.py

you can provide those same configuration options in both spark-shell as well.

One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

Dharman
  • 30,962
  • 25
  • 85
  • 135