How to configure Hive to use Spark execution engine on Google Dataproc?

Question

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.

Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.

Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.

score 4 · Answer 1 · answered Apr 11 '17 at 00:27

4

This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.

If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.

If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

answered Apr 11 '17 at 00:27

Patrick Clay

1,339
7
5

1

Thanks! I actually tried running on Tez before trying Spark but that didn't work either. I used that initialisation action you mention which successfully installed Tez but when I set `hive.execution.engine=tez` I kept getting `Error running query: java.lang.NoClassDefFoundError: org/apache/tez/runtime/api/Event`. Do you know what else do I need to configure for Tez to work? – domkck Apr 11 '17 at 09:50

How to configure Hive to use Spark execution engine on Google Dataproc?

1 Answers1