Try to use package org.elasticsearch:elasticsearch-spark-30_2.12:7.13.1
The error you're seeing (java.lang.NoClassDefFoundError: scala/Product$class
) usually indicates that you are trying to use a package built for an incompatible version of Scala.
If you are using the most recent zip package from Elasticsearch, as of the date of your question, it is still built for Scala v11, as per the conversation here:
https://github.com/elastic/elasticsearch-hadoop/pull/1589
You can confirm the version of Scala used to build your PySpark by doing
spark-submit --version
from the command line. After the Spark logo it will say something like
Using Scala version 2.12.10
You need to take a look at this page:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
On that page you can see the compatibility matrix.
Elastic gives you some info on "installation" for Hadoop here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
For Spark, it provides this:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-30_2.12</artifactId>
<version>7.14.0</version>
</dependency>
Now if you're using PySpark, you may be unfamiliar with Maven, so I can appreciate that it's not that helpful to be given the maven dependency.
Here's a minimal way to get maven to get your jar for you, without having to get into the weeds of an unfamiliar tool.
Install maven (apt install maven
)
Create a new directory
In that directory, create a file called pom.xml
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>spark-es</groupId>
<artifactId>spark-esj</artifactId>
<version>1</version>
<dependencies>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-30_2.12</artifactId>
<version>7.14.0</version>
</dependency>
</dependencies>
Save that file and create an additional directory called "targetdir" (it could be called anything)
Then
mvn dependency:copy-dependencies -DoutputDirectory=targetdir
You'll find your jar in targetdir.