Spark Events : Reading Hive table created thru Hive cli vs Hive table created thru Spark

Asked Jun 08 '23 at 08:19

Active Jun 08 '23 at 10:26

Viewed 50 times

While working on Spark Event listener, am bit confused with the way Spark is behaving.

Scenario 1: Hive table created using Hive cli

Suppose EMPLOYEE is the hive external/internal table created using hive cli and when we read this table through Spark (either api/sql) the LogicalPlan generated has reference to catalog object that provides clear details of Hive database and Hive table name as below

CatalogTable(
  Database: company
  Table: EMPLOYEE
  Created Time: Mon Jun 01 02:59:11 GMT 2022
  Last Access: UNKNOWN
  Created By: Spark
  Type: MANAGED
  Provider: orc
)

Scenario 2: Hive table created using Spark

Suppose if the same EMPLOYEE table is created using Spark api (saveAsTable)and when we read back this table through spark (either api/sql) the LogicalPlan generated has reference to HDFS path but not Catalog table object though this table is available in Hive catalog and is accessible through hive cli as like earlier EMPLOYEE(created thru Hive cli) or other tables.

Following is the spark code used to write EMPLOYEE table just in case if it helps.

SparkSession spark = SparkSession
            .builder()
            .appName("HiveZipCodePipeline")
            .enableHiveSupport()
            .getOrCreate();

    spark.sparkContext().setLogLevel("DEBUG");

    Dataset<Row> sqlDF = spark.read().table("company.emp_source");

    sqlDF.write().mode(SaveMode.Overwrite).format("csv").saveAsTable("company.employee");

    spark.close();

Why is this difference? How can we deduce that in the second scenario the source is still a Hive table?

edited Jun 08 '23 at 10:26

asked Jun 08 '23 at 08:19

Gurupraveen

spark.sql.SparkSession.builder.enableHiveSupport() – thebluephantom Jun 08 '23 at 09:09
Yes indeed enableHiveSupport() is done otherwise i guess we would not be able to read hive tables. – Gurupraveen Jun 08 '23 at 09:19
To avoid confusion show code, then. – thebluephantom Jun 08 '23 at 10:02
1

Added the code just in case if it helps, but i guess there are some other perhaps hive table properties that are resulting in this behavior. Hoping to get good insights into it through community. – Gurupraveen Jun 08 '23 at 10:28
It is a confusing topic at times. I will look tonight. – thebluephantom Jun 08 '23 at 10:29
One is orc, one is csv. – thebluephantom Jun 08 '23 at 10:29
It does not matter, actually I have created hive table thru cli as text format and even seen this behaviour with orc. So hive storage format is immaterial. – Gurupraveen Jun 09 '23 at 00:11
Only for clarity – thebluephantom Jun 09 '23 at 05:51
What is your env? – thebluephantom Jun 09 '23 at 09:58
I expect non cloud as answer – thebluephantom Jun 09 '23 at 10:01
Yes its on prem cluster – Gurupraveen Jun 09 '23 at 10:46

Spark Events : Reading Hive table created thru Hive cli vs Hive table created thru Spark

0 Answers0