Spark as execution engine with Hive

Question

Can spark 2.4.2 be used as an execution engine with hive 2.3.4 on Amazon EMR?

I have linked the jar files with hive (scala-library, spark-core, spark-common-network) via the following commands:

cd $HIVE_HOME/lib
ln -s $SPARK_HOME/jars/spark-network-common_2.11-2.4.2.jar
ln -s $SPARK_HOME/jars/spark-core_2.11-2.4.2.jar
ln -s $SPARK_HOME/jars/scala-library-2.11.12.jar

Added the following settings in hive-site.xml:

<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
    <description>Use Map Reduce as default execution engine</description>
</property>
<property>
    <name>spark.master</name>
    <value>spark://<EMR hostname>:7077</value>
  </property>
<property>
    <name>spark.eventLog.enabled</name>
    <value>true</value>
  </property>
<property>
    <name>spark.eventLog.dir</name>
    <value>/tmp</value>
  </property>
<property>
    <name>spark.serializer</name>
    <value>org.apache.spark.serializer.KryoSerializer</value>
  </property>
<property>
  <name>spark.yarn.jars</name>
  <value>hdfs://<EMR hostname>:54310/spark-jars/*</value>
</property>

Spark is up and running and I am also able to use hive queries with pyspark. But when I try to use spark as execution engine with hive with the above-mentioned configurations, it throws the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable
    at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
    at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173)
    at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
    at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:56)
    at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
    at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
    at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
    at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetReducerParallelism(SparkCompiler.java:288)
    at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:122)
    at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:140)
    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11293)
    at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286)
    at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512)
    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: scala.collection.Iterable
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 33 more

Is this a configuration error or some version incompatibility error?

Also hive is working perfectly with tez ...

score 0 · Answer 1 · edited Oct 17 '21 at 13:30

This is clear indication of scala jar library mismatches which hive using since you are using incompatible scala changes for hive with spark option.

Tez doestnt use spark and scala thats the reason its working fine. spark is using scala as lang, and its not able to find the right version. thats the reason you are gettting

java.lang.NoClassDefFoundError: scala/collection/Iterable

This is very common issue when you are using hive with spark as execution engine...

Steps : .

goto $HIVE_HOME/bin/hive
take a back up of the file before editing $HIVE_HOME/bin/hive
take classpath variable and first add all hive jars

CLASSPATH=${CLASSPATH}:${HIVE_LIB}/.jar for f in ${HIVE_LIB}/.jar; do CLASSPATH=${CLASSPATH}:$f; done

Add spark lib to hive classpath like below classpath variable which has all hive libraries..

for f in ${SPARK_HOME}/jars/*.jar; do
     CLASSPATH=${CLASSPATH}:$f;
done

Now we have hive jars and spark jars in the same classpath variable. Spark jars has scala libraries which are correct to use with spark and there are no version compatiblity issues.

Now change hive execution engine to point to spark in hive-site.xml which you are already aware.../doing
hive.execution.engine spark Use Spark as execution engine

Another option is using softlinks like below example...

Link Jar Files Now we make soft links to certain Spark jar files so that Hive can find them:

ln -s /usr/share/spark/spark-2.2.0/dist/jars/spark-network-common_2.11-2.2.0.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/spark-network-common_2.11-2.2.0.jar
ln -s /usr/share/spark/spark-2.2.0/dist/jars/spark-core_2.11-2.2.0.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/spark-core_2.11-2.2.0.jar
ln -s /usr/share/spark/spark-2.2.0/dist/jars/scala-library-2.11.8.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/scala-library-2.11.8.jar

Conclusion : In any case you need to make sure that right scala jars are pointing to hive which is used by spark as execution engine...

I came across these solutions beforehand, but I wanted to know if spark 2.4.2 is compatible with hive 2.3.4... Because I added the jar files available in the spark jars directory via ln -s command only. Following jars are available in the jars folder: spark-network-common_2.11-2.4.2.jar spark-core_2.11-2.4.2.jar scala-library-2.11.12.jar — Shubham Gupta, Jul 02 '19 at 05:29
Actually, I did not understand the first solution you mentioned(I tried to copy both the blocks of codes you mentioned in 2nd step at the end of my $HIVE_HOME/bin/hive file) and then run hive... It did not work... — Shubham Gupta, Jul 02 '19 at 13:48
Also, I have found that there was some path mistake while linking the jars... But the spark is still not able to connect as execution engine with hive. I have put it as a separate question: https://stackoverflow.com/questions/56853923/failed-to-create-client-spark-as-execution-engine-with-hive — Shubham Gupta, Jul 02 '19 at 13:50
copy scala library in to hive/jars folder where hive installed. and run the first solution ... aand post here — Ram Ghadiyaram, Jul 02 '19 at 14:31
this is simple to understand, basically we need to add all spark jars including scala library in to hive class path. just echo what you are doing in the shell script you can understand. — Ram Ghadiyaram, Jul 02 '19 at 14:45

Spark as execution engine with Hive

1 Answers1

Linked