0

While working on Spark Event listener, am bit confused with the way Spark is behaving.

Scenario 1: Hive table created using Hive cli

Suppose EMPLOYEE is the hive external/internal table created using hive cli and when we read this table through Spark (either api/sql) the LogicalPlan generated has reference to catalog object that provides clear details of Hive database and Hive table name as below

CatalogTable(
  Database: company
  Table: EMPLOYEE
  Created Time: Mon Jun 01 02:59:11 GMT 2022
  Last Access: UNKNOWN
  Created By: Spark
  Type: MANAGED
  Provider: orc
)

Scenario 2: Hive table created using Spark

Suppose if the same EMPLOYEE table is created using Spark api (saveAsTable)and when we read back this table through spark (either api/sql) the LogicalPlan generated has reference to HDFS path but not Catalog table object though this table is available in Hive catalog and is accessible through hive cli as like earlier EMPLOYEE(created thru Hive cli) or other tables.

Following is the spark code used to write EMPLOYEE table just in case if it helps.

SparkSession spark = SparkSession
            .builder()
            .appName("HiveZipCodePipeline")
            .enableHiveSupport()
            .getOrCreate();

    spark.sparkContext().setLogLevel("DEBUG");

    Dataset<Row> sqlDF = spark.read().table("company.emp_source");

    sqlDF.write().mode(SaveMode.Overwrite).format("csv").saveAsTable("company.employee");

    spark.close();

Why is this difference? How can we deduce that in the second scenario the source is still a Hive table?

Gurupraveen
  • 181
  • 1
  • 13

0 Answers0