Elasticsearch plugin for PySpark 3.1.1

Question

I used Elasticsearch Spark 7.12.0 with PySpark 2.4.5 successfully. Both read and write were perfect. Now, I'm testing the upgrade to Spark 3.1.1, this integration doesn't work anymore. No code change in PySpark between 2.4.5 & 3.1.1.

Is there a compatible plugin? Has anyone got this to work with PySpark 3.1.1?

The error:

score 3 · Answer 1 · edited Aug 09 '21 at 07:40

Try to use package org.elasticsearch:elasticsearch-spark-30_2.12:7.13.1

The error you're seeing (java.lang.NoClassDefFoundError: scala/Product$class) usually indicates that you are trying to use a package built for an incompatible version of Scala.

If you are using the most recent zip package from Elasticsearch, as of the date of your question, it is still built for Scala v11, as per the conversation here: https://github.com/elastic/elasticsearch-hadoop/pull/1589

You can confirm the version of Scala used to build your PySpark by doing

spark-submit --version

from the command line. After the Spark logo it will say something like

Using Scala version 2.12.10

You need to take a look at this page: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html On that page you can see the compatibility matrix.

Elastic gives you some info on "installation" for Hadoop here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html

For Spark, it provides this:

<dependency>
 <groupId>org.elasticsearch</groupId>
 <artifactId>elasticsearch-spark-30_2.12</artifactId> 
 <version>7.14.0</version>
</dependency>

Now if you're using PySpark, you may be unfamiliar with Maven, so I can appreciate that it's not that helpful to be given the maven dependency.

Here's a minimal way to get maven to get your jar for you, without having to get into the weeds of an unfamiliar tool.

Install maven (apt install maven)

Create a new directory

In that directory, create a file called pom.xml

<project>
<modelVersion>4.0.0</modelVersion>
<groupId>spark-es</groupId>
<artifactId>spark-esj</artifactId>
<version>1</version>
<dependencies>
    <dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch-spark-30_2.12</artifactId>
        <version>7.14.0</version>
    </dependency>
</dependencies>

Save that file and create an additional directory called "targetdir" (it could be called anything)

Then

mvn dependency:copy-dependencies -DoutputDirectory=targetdir

You'll find your jar in targetdir.

very helpful, just missing the closing `` tag in the pom.xml — WindDude, Apr 18 '22 at 16:50

Elasticsearch plugin for PySpark 3.1.1

1 Answers1