Saving data in Elasticsearch using PySpark

Question

I have a program that takes a dataframe and should save it into Elasticsearch. Here's what it looks like when I save the dataframe:

    model_df.write.format(
        "org.elasticsearch.spark.sql"
    ).option(
        "pushdown", True
    ).option(
        "es.nodes", "example.server:9200"
    ).option("es.index.auto.create", True
    ).mode('append').save("EPTestIndex/")

When I run my program, I get this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o96.save. : java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html

I did some research and thought I needed a jar, so I added these configurations to my SparkSession:

spark = SparkSession.builder.config("jars", "/Users/public/ProjectDirectory/lib/elasticsearch-spark-20_2.11-6.0.1.jar")\
    .getOrCreate()
sqlContext = SQLContext(spark)

I initialize the SparkSession in main and write to ES in another package. The package takes the dataframe and runs the write command above. However, even with this I am still getting the same ClassNotFoundExceptioin What might be the issue?

I am running this program in PyCharm, how can I make it so that PyCharm is able to run it?

score -1 · Answer 1 · answered Feb 08 '19 at 10:54

-1

Elasticsearch exposes a JSON API and a pandas dataframe is not a JSON supported type.

If you had to insert it, you could serialize the dataframe using dataframe.to_json()

answered Feb 08 '19 at 10:54

Viseshini Reddy

744
3
13

Saving data in Elasticsearch using PySpark

1 Answers1