0

Running a spark(scala) job on HDP cluster. However every time the job executes(both client and cluster mode) a parallel Tez session is also created and application is submitted to YARN. As part of spark job the, couple of sql jobs which are being executed on the cluster are done using 'SparkSession.spark.sql'.

The Tez session is created before stage 0, task 0 as per YARN logs. Trying to understand why 2 jobs are running on RM, everytime spark-submit is called.

Checked already - No explicit connections made to Hive.

Any leads would be appreciated.

  • Remove tez related configuration, in spark and hive hive-site.xml and hadoop in the mapred-site.xml. – Housheng-MSFT Sep 09 '22 at 01:59
  • @housheng But wouldn't that affect the overall hive, spark functionality on entire cluster as well? Like if I remove hive.execution.engine configuration from hive-site.xml, it might break the functionality as well right? My concern is regarding extra Tez session getting created with spark job, don't want to affect the Hive functionality, we do submit hive queries on Tez engine in the cluster. – Shubham Acharya Sep 09 '22 at 05:12
  • Spark will take over Hive that has been deployed outside of Hive. 1. Make sure the original Hive is working properly. 2. Make sure that the original Hive is working properly. 3. Remove the related configuration of tez, in hive-site.xml and hadoop of spark and hive xml (not required). 4. Copy the MySQL driver to Spark's jars/ directory. 5. The hive service needs to be started in advance. Copy site.xml and hdfs-site.xml to spark's conf/ directory (not required) – Housheng-MSFT Sep 09 '22 at 05:25
  • If you use an external hive to make sure that hive is running normally, you need to start the hive service in advance. – Housheng-MSFT Sep 09 '22 at 05:28

0 Answers0