-1

I am using tinkerpop + Janus Graph + Spark

build.gradle

compile group: 'org.apache.tinkerpop', name: 'spark-gremlin', version: '3.1.0-incubating'

below is some critical configuration that we have

spark.serializer: org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer

In the logs corresponding long entry which refer the jar containing the above class is loaded

{"@timestamp":"2020-02-18T07:24:21.720+00:00","@version":1,"message":"Added JAR /opt/data/janusgraph/applib2/spark-gremlin-827a65ae26.jar at spark://gdp-identity-stage.target.com:38876/jars/spark-gremlin-827a65ae26.jar with timestamp 1582010661720","logger_name":"o.a.s.SparkContext","thread_name":"SparkGraphComputer-boss","level":"INFO","level_value":20000}

but my spark job submitted by SparkGraphComputer is failed, when we see executor logs, we saw

Caused by: java.lang.ClassNotFoundException: org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer

Why this exception is coming even though the corresponding jar is loaded?

Anyone, please suggest on this.

As I mention seeing this exception in spark executor when I opened one of the worker logs below complete exception

Spark Executor Command: "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64/bin/java" "-cp" "/opt/spark/spark-2.4.0/conf/:/opt/spark/spark-2.4.0/jars/*:/opt/hadoop/hadoop-3_1_1/etc/hadoop/" "-Xmx56320M" "-Dspark.driver.port=43137" "-XX:+UseG1GC" "-XX:+PrintGCDetails" "-XX:+PrintGCTimeStamps" "-Xloggc:/opt/spark/gc.log" "-Dtinkerpop.gremlin.io.kryoShimService=org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@gdp-identity-stage.target.com:43137" "--executor-id" "43392" "--hostname" "192.168.192.10" "--cores" "6" "--app-id" "app-20200220094335-0001" "--worker-url" "spark://Worker@192.168.192.10:36845"
========================================

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
    at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:259)
    at org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:280)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:283)
    at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:200)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:221)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    ... 4 more

when I am setting the spark. jars property on graph, am passing this jar location also

Jar which we created from the application is of fat jar type means it contains the actual code and all the required dependencies also, please see below screenshotsenter image description here .enter image description here

Bravo
  • 8,589
  • 14
  • 48
  • 85

1 Answers1

2

If you look at the logs, you see this

java" "-cp" "/opt/spark/spark-2.4.0/conf/:/opt/spark/spark-2.4.0/jars/*:/opt/hadoop/hadoop-3_1_1/etc/hadoop/"

Unless you have the gremlin JARs in your /opt/spark/spark-2.4.0/jars/* folder on each Spark worker, then the class you're using doesn't exist.

The recommended way to include it for your specific application would be the Gradle Shadow plugin rather than --packages or spark.jars

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • thanks for the response, can you please provide more details on Gradle Shadow plugin and how to use this in my case – Bravo Feb 21 '20 at 04:45
  • the jar which we created from the code is of type fat jar only, it contains code and all dependency jars also. one doubt why spark.jars which we set is not working ? – Bravo Feb 21 '20 at 05:07
  • 1) I dont know your full spark submit command 2) you should not need spark.jars with a fat/uber jar (created by the shadow plugin) 3) I see no reference to your jar in the java command that is displayed so its not clear to me how your code would be loaded at all – OneCricketeer Feb 21 '20 at 09:12