Pyspark dataframe store to MongoDB error

Question

The code works fine with the pyspark shell but when I'm trying to write a program in Java or Scala, I'm getting exceptions.

What is the best way to store spark dataframe to MongoDB using python?

pyspark version- 2.2.0
MongoDB version- 3.4
Python 2.7
Java - jdk-9

Here is my code:

from pyspark import SparkContext
from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.coll") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

dataframe = my_spark.read.csv('auto-data.csv', header=True)
dataframe.write.format("com.mongodb.spark.sql.DefaultSource") \
    .mode("append").option("database", "auto").option("collection", "autod").save()

and the snapshot of my csv data.

and the errors:

I tried after installing mongo-spark library from github, yet getting the same result.

You need to provide the required jars packages using `--jars` options while submitting the script. Error clearly points that , it is not able to find the required class. — pauli, Sep 29 '17 at 02:48
I pretty much tried that. I also put the mongo-spark which contains the jar files. But still I couldn't solve this issue. — Rushi Pandya, Sep 29 '17 at 03:18
Post the full command you are using to run the script. It might be helpful in learning what you are missing. — pauli, Sep 29 '17 at 03:38
Just a wild thought. Can you fallback to JDK8. I don't think Spark is compatible yet with JDK9. Then, try again and see if you get the same errors. — geo, Sep 29 '17 at 04:11
@ashwinids I used python "filename.py" to run the script. I tried same in pyspark shell it worked. But, not with python script. — Rushi Pandya, Sep 29 '17 at 12:03
You have to use `spark-submit` shell script to run the python script. see this [**question**](https://stackoverflow.com/questions/38120011/using-spark-submit-with-python-main) to understand how to run python scripts and attach dependency jars — pauli, Sep 29 '17 at 12:20

score 0 · Answer 1 · answered Jun 24 '19 at 09:00

You need to download all the dependencies and store at a location , "/opt/jars" in the following example Jars required 1. mongo-spark-connector_2.12-2.4.0.jar 2. mongodb-driver-3.10.1.jar 3. mongo-hadoop-core-1.3.0.jar (Incase running spark on yarn)

sudo wget https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/2.4.0/mongo-spark-connector_2.12-2.4.0.jar
sudo wget https://repo1.maven.org/maven2/org/mongodb/mongodb-driver/3.10.1/mongodb-driver-3.10.1.jar
sudo wget https://repo1.maven.org/maven2/org/mongodb/mongo-hadoop-core/1.3.0/mongo-hadoop-core-1.3.0.jar

Then execute with the following command

spark-submit --jars "/opt/jar/*.jar" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 <your file>.py arg1 arg2

Pyspark dataframe store to MongoDB error

1 Answers1