While working on Spark Event listener, am bit confused with the way Spark is behaving.
Scenario 1: Hive table created using Hive cli
Suppose EMPLOYEE is the hive external/internal table created using hive cli and when we read this table through Spark (either api/sql) the LogicalPlan generated has reference to catalog object that provides clear details of Hive database and Hive table name as below
CatalogTable(
Database: company
Table: EMPLOYEE
Created Time: Mon Jun 01 02:59:11 GMT 2022
Last Access: UNKNOWN
Created By: Spark
Type: MANAGED
Provider: orc
)
Scenario 2: Hive table created using Spark
Suppose if the same EMPLOYEE table is created using Spark api (saveAsTable)and when we read back this table through spark (either api/sql) the LogicalPlan generated has reference to HDFS path but not Catalog table object though this table is available in Hive catalog and is accessible through hive cli as like earlier EMPLOYEE(created thru Hive cli) or other tables.
Following is the spark code used to write EMPLOYEE table just in case if it helps.
SparkSession spark = SparkSession
.builder()
.appName("HiveZipCodePipeline")
.enableHiveSupport()
.getOrCreate();
spark.sparkContext().setLogLevel("DEBUG");
Dataset<Row> sqlDF = spark.read().table("company.emp_source");
sqlDF.write().mode(SaveMode.Overwrite).format("csv").saveAsTable("company.employee");
spark.close();
Why is this difference? How can we deduce that in the second scenario the source is still a Hive table?