0

The code works fine with the pyspark shell but when I'm trying to write a program in Java or Scala, I'm getting exceptions.

What is the best way to store spark dataframe to MongoDB using python?

  • pyspark version- 2.2.0
  • MongoDB version- 3.4
  • Python 2.7
  • Java - jdk-9

Here is my code:

from pyspark import SparkContext
from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.coll") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

dataframe = my_spark.read.csv('auto-data.csv', header=True)
dataframe.write.format("com.mongodb.spark.sql.DefaultSource") \
    .mode("append").option("database", "auto").option("collection", "autod").save()

and the snapshot of my csv data.

and the errors:

I tried after installing mongo-spark library from github, yet getting the same result.

philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • You need to provide the required jars packages using `--jars` options while submitting the script. Error clearly points that , it is not able to find the required class. – pauli Sep 29 '17 at 02:48
  • I pretty much tried that. I also put the mongo-spark which contains the jar files. But still I couldn't solve this issue. – Rushi Pandya Sep 29 '17 at 03:18
  • Post the full command you are using to run the script. It might be helpful in learning what you are missing. – pauli Sep 29 '17 at 03:38
  • Just a wild thought. Can you fallback to JDK8. I don't think Spark is compatible yet with JDK9. Then, try again and see if you get the same errors. – geo Sep 29 '17 at 04:11
  • Use Java 8 or below. – philantrovert Sep 29 '17 at 06:32
  • @ashwinids I used python "filename.py" to run the script. I tried same in pyspark shell it worked. But, not with python script. – Rushi Pandya Sep 29 '17 at 12:03
  • You have to use `spark-submit` shell script to run the python script. see this [**question**](https://stackoverflow.com/questions/38120011/using-spark-submit-with-python-main) to understand how to run python scripts and attach dependency jars – pauli Sep 29 '17 at 12:20

1 Answers1

0

You need to download all the dependencies and store at a location , "/opt/jars" in the following example Jars required 1. mongo-spark-connector_2.12-2.4.0.jar 2. mongodb-driver-3.10.1.jar 3. mongo-hadoop-core-1.3.0.jar (Incase running spark on yarn)

sudo wget https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/2.4.0/mongo-spark-connector_2.12-2.4.0.jar
sudo wget https://repo1.maven.org/maven2/org/mongodb/mongodb-driver/3.10.1/mongodb-driver-3.10.1.jar
sudo wget https://repo1.maven.org/maven2/org/mongodb/mongo-hadoop-core/1.3.0/mongo-hadoop-core-1.3.0.jar

Then execute with the following command

spark-submit --jars "/opt/jar/*.jar" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 <your file>.py arg1 arg2 
Neha Jirafe
  • 741
  • 5
  • 14