1

I'm using a spark-shell instance to test the pulling of data from a client's kafka source. To launch the instance I am using the command spark-shell --jars spark-sql-kafka-0-10_2.11-2.5.0-palantir.8.jar, kafka_2.12-2.5.0.jar, kafka-clients-2.5.0.jar (all jars are present in the woring dir).

However, when I run the command val df = spark.read.format("kafka")........... after a few seconds it crashes with the below:

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamingWriteSupportProvider
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:455)
  at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:344)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
  at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:533)
  at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:89)
  at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:89)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:304)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
  ... 48 elided
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamingWriteSupportProvider
  at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 79 more

HOWEVER - if I change the order of the jars in the spark-shell command to spark-shell --jars kafka_2.12-2.5.0.jar, kafka-clients-2.5.0.jar, spark-sql-kafka-0-10_2.11-2.5.0-palantir.8.jar, instead crashes with:

java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArrayDeserializer
  at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:376)
  at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
  at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateBatchOptions(KafkaSourceProvider.scala:330)
  at org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:113)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
  ... 48 elided
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArrayDeserializer
  at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 55 more

I am developing behind a very strict proxy managed by our client and am unable to user --packages instead, and I am at a bit of a loss here, am I unable to load all 3 dependencies at the launch of the shell? Am I missing another step somewhere?

Cam
  • 2,026
  • 3
  • 25
  • 42

3 Answers3

2

In the Structured Streaming + Kafka Integration Guide it says:

For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell.

The library you are using seems to be customized and not publicly available in the maven central repository. That means, I can not look into its dependencies.

However, looking at the latest stable version 2.4.5 the dependencies according to maven central repository is kafka-clients version 2.0.0.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • Hi mike, thanks for your response - I downloaded the spark-sql-kafka 2.4.5 lib as well as kafka-clients 2.0.0, however I get the same error when I run a spark-shell instance including these 2 jars – Cam May 26 '20 at 09:51
  • Try adding *all* dependcies, meaning also `spark-tags` and `spark-sql` as stated in the link I provided. – Michael Heil May 26 '20 at 10:16
  • No joy I'm afraid, adding all dependencies still give the same error. As a test, I ran `import org.apache.spark.sql.sources.v2.StreamWriteSuport` and the response I got was `object v2 is not a member of package...` – Cam May 26 '20 at 10:32
  • Why do you need to import `org.apache.spark.sql.sources.v2.StreamWriteSuport`? For reading messages from Kafka with structured streaming you do not need this I suppose. – Michael Heil May 26 '20 at 10:34
  • I only imported it as a test - that's the missing class that causes the crash, but it seems the entire object v2 is not present, not just that one class – Cam May 26 '20 at 10:37
0

One occasionally disruptive issue is dealing with dependency conflicts in cases where a user application and Spark itself both depend on the same library. This comes up relatively rarely, but when it does, it can be vexing for users. Typically, this will manifest itself when a NoSuchMethodError, a ClassNotFoundException, or some other JVM exception related to class loading is thrown during the execution of a Spark job. There are two solutions to this problem. The first is to modify your application to depend on the same version of the third-party library that Spark does. The second is to modify the packaging of your application using a procedure that is often called “shading.” The Maven build tool supports shading through advanced configuration of the plug-in shown in Example 7-5 (in fact, the shading capability is why the plugin is named maven-shade-plugin). Shading allows you to make a second copy of the conflicting package under a different namespace and rewrites your application’s code to use the renamed version. This somewhat brute-force technique is quite effective at resolving runtime dependency conflicts. For specific instructions on how to shade dependencies, see the documentation for your build tool.

I would try to know the scala version of the spark-shell because, it can be a scala version issue

scala> util.Properties.versionString
res3: String = version 2.11.8

if not, then check what spark version you are using and third-party library versions you are using as dependencies because, I am sure there is newest or oldest that your spark version doesn't support.

I hope it helps.

Chema
  • 2,748
  • 2
  • 13
  • 24
  • Thanks for your reply - the `spark-sql-kafka` package targets scala version 2.11, which matches the version of scala I have, `2.11.8` – Cam May 26 '20 at 09:54
  • What scala version do you have in your build.sbt and what scala version is managed by spark-shell? – Chema May 26 '20 at 11:40
  • `2.11.8` for both – Cam May 26 '20 at 11:52
  • And in your development environment it didn't crash, did it? – Chema May 26 '20 at 11:56
  • This is a PoC environment, everything we do is hosted by the client – Cam May 26 '20 at 12:04
  • Ok, but, have you run your code at least once before trying to run it in spark-shell? – Chema May 26 '20 at 12:34
  • No, because the code is literally nothing more than the spark.read line I posted in my question. – Cam May 26 '20 at 12:59
  • 1
    Please check your Spark version with the command print(sc.version) in spark-shell to check if there is a problem with matching versions. – Chema May 26 '20 at 16:20
0

You are trying to import multiple scala versions 2.11 & 2.12 of different libraries.

Please add same version of scala libraries & check below how to import into spark-shell.

spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,org.apache.kafka:kafka_2.11:2.4.1,org.apache.kafka:kafka-clients:2.4.1

Srinivas
  • 8,957
  • 2
  • 12
  • 26