6

I am able to run my program in standalone mode. But when I am trying to run in Dataproc in cluster mode, getting following error. PLs help. My build.sbt

name := "spark-kafka-streaming"
    
  version := "0.1"
    
  scalaVersion := "2.12.10"
    
  val sparkVersion = "2.4.5"
    
  libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
  libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
  libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
  libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion
    
  assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
  assemblyJarName in assembly := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion}_${version.value}.jar"
    
  assemblyMergeStrategy in assembly := {
      case PathList("org","aopalliance", xs @ _*) => MergeStrategy.last
      case PathList("javax", "inject", xs @ _*) => MergeStrategy.last
      case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
      case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
      case PathList("org", "apache", xs @ _*) => MergeStrategy.last
      case PathList("com", "google", xs @ _*) => MergeStrategy.last
      case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
      case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
      case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
      case "about.html" => MergeStrategy.rename
      case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
      case "META-INF/mailcap" => MergeStrategy.last
      case "META-INF/mimetypes.default" => MergeStrategy.last
      case "plugin.properties" => MergeStrategy.last
      case "log4j.properties" => MergeStrategy.last
      case y: String if y.contains("UnusedStubClass") => MergeStrategy.first
      case x =>
        val oldStrategy = (assemblyMergeStrategy in assembly).value
        oldStrategy(x)
    }

Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.internal.connector.SimpleTableProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Command Used: spark-submit --class Main --master yarn --deploy-mode cluster --num-executors 1 --driver-memory 4g --executor-cores 4 --executor-memory 4g --files x.json y.jar

Edit:

Cluster config: Image: 1.5.4-debian10 spark-submit --version version 2.4.5 Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_252

Jar- Build Uber jar with command sbt assembly.

Gcloud command: gcloud dataproc jobs submit spark --cluster=xyz --region=us-west1 --class=Main --files x.json --jars=spark-kafka-streaming_2.12-3.0.0_0.1.jar

Logs:

ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: User class threw exception: java.lang.NoClassDefFoundError: org/apache/spark/sql/internal/connector/SimpleTableProvider at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:347) at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:645) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:629) at Lineage$.delayedEndpoint$Lineage$1(Lineage.scala:17) at Lineage$delayedInit$body.apply(Lineage.scala:3) at scala.Function0.apply$mcV$sp(Function0.scala:39) at scala.Function0.apply$mcV$sp$(Function0.scala:39) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) at scala.App.$anonfun$main$1$adapted(App.scala:80) at scala.collection.immutable.List.foreach(List.scala:392) at scala.App.main(App.scala:80) at scala.App.main$(App.scala:78) at Lineage$.main(Lineage.scala:3) at Lineage.main(Lineage.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:686) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.internal.connector.SimpleTableProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 49 more

Root Cause and Solution: As pointed in the answer, it was problem with the jar. I was using IDEA sbt shell for building the jar. And any changes made in the build.sbt is not loaded again after the shell is launched. So, though I changed the version, but it was not picked, until I restarted the sbt shell again. Learned it hard way.

Amit Joshi
  • 172
  • 1
  • 14
  • 1
    Can you add `spark-submit --version`? You seem to be using Spark 3 (not 2.4.5) as the CNFE is for `SimpleTableProvider` that was just added in [v3.0.0-rc1](https://github.com/apache/spark/commit/9f42be25eba462cca8148ce636d6d3d20123d8fb#diff-a0604cfd3c6c9d66b93cb770892a4cd2). – Jacek Laskowski Jul 16 '20 at 18:31
  • Can you please show the command you used to create the cluster? which image version is it (1.3, 1.4, 1.5) ? Why aren't you using the `gcloud jobs submit spark` command - it will take the correct spark version – David Rabinowitz Jul 17 '20 at 01:23
  • @JacekLaskowski, Spark version is 2.4.5. I have logged in to the master node and got this version. This was the first step which I did to cross check when this problem came. – Amit Joshi Jul 17 '20 at 03:34
  • @DavidRabinowitz, Dataproc cluster image is 1.5.4-debian10, which is spark 2.4.5. I have logged in to the master node and submitted the job. I thought that would give me more control over yarn commands. But anyhow, I guess that would have not made the difference, as the spark version is 2.4.5 in cluster. – Amit Joshi Jul 17 '20 at 03:37
  • Can you please log in to your system and execute `spark-submit --version`. What's `y.jar`? What command creates it? Add the answers to your question. Thanks. – Jacek Laskowski Jul 17 '20 at 10:39
  • Which yarn commands are missing? Can you please try to submit viat the `gcloud` CLI or the console? – David Rabinowitz Jul 17 '20 at 15:03
  • @JacekLaskowski, Please find the edits in the questions. y.jar, represent the uber jar, not the real name of the jar. – Amit Joshi Jul 17 '20 at 18:02
  • @DavidRabinowitz, Same exception with google cloud command. Pls find the edit in the question. – Amit Joshi Jul 17 '20 at 18:17
  • Can you check out if you use any environment variables that would influence what Spark version you use in the end, e.g. `SPARK_HOME`. Can you check out `$SPARK_HOME/conf/spark-defaults.conf`? – Jacek Laskowski Jul 17 '20 at 20:17
  • Can you also make sure the uber-jar has no Spark classes included (i.e. no classes in org/apache/spark directory). – Jacek Laskowski Jul 17 '20 at 20:23
  • Based on `assemblyJarName in assembly := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion}_${version.value}.jar"` your jar name should have been `spark-kafka-streaming_2.12-2.4.5_0.1.jar`. The actual jar name implied you may be using spark 3.0.0 API and deploying on spark 2.4.5 – David Rabinowitz Jul 17 '20 at 22:42
  • @DavidRabinowitz, you were spot on with the jar. My bad, for not building the jar properly. – Amit Joshi Jul 18 '20 at 06:37

2 Answers2

5

Based on assemblyJarName in assembly := s"${name.value}${scalaBinaryVersion.value}-${sparkVersion}${version.value}.jar" your jar name should have been spark-kafka-streaming_2.12-2.4.5_0.1.jar. The actual jar name implied you may be using spark 3.0.0 API and deploying on spark 2.4.5

David Rabinowitz
  • 29,904
  • 14
  • 93
  • 125
1

Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.internal.connector.SimpleTableProvider

org.apache.spark.sql.internal.connector.SimpleTableProvider was added in v3.0.0-rc1 so you're using spark-submit from Spark 3.0.0 (I guess).


I only now noticed that you use --master yarn and the exception is thrown at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:686).

I know nothing about Dataproc, but you should review the configuration of YARN / Dataproc and make sure they don't use Spark 3 perhaps.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420