0

I have created the code below to identify if a directory is a delta table/file/directory. Its kind of brute force, but it appears to work for the most part. I am wondering if there is a more elegant way to determine this. I am in a databricks environment using azure storage. The details of the code are not important, I am just wondering about an algorithm that is better than what I have here. Any help appreciated.

%scala
import scala.collection.mutable._
import spark.sqlContext.implicits._

case class cls(objectKey:String)

val snappyDf = spark.sql("SELECT distinct objectKey FROM silver_latest WHERE objectKey like '%.snappy.parquet%'").as[cls]

val deltaDf = spark.sql("SELECT distinct objectKey FROM silver_latest WHERE objectKey like '%/_delta_log/%'").as[cls]
Alex Ott
  • 80,552
  • 8
  • 87
  • 132

1 Answers1

0

If your table is defined in a catalog like Hive Metastore or Unity Catalog, you can describe details about the table's metadata to find out this information, known as the "Provider" of the table:

PySpark:

fmt = spark.sql("DESC EXTENDED silver_latest").where("col_name = 'Provider'").select("data_type").collect()[0].data_type

# prints provider of the table such as 'parquet' or 'delta'
print(fmt)

Scala:

val fmt = spark.sql("DESC EXTENDED silver_latest")
  .where("col_name = 'Provider'")
  .select("data_type")
  .collect()
  .head
  .getAs[String]("data_type")

# prints provider of the table such as 'parquet' or 'delta'
println(fmt)

You can also query this for multiple tables at once by querying the information_schema in Databricks: https://docs.databricks.com/sql/language-manual/information-schema/tables.html

SELECT 
  `TABLE_NAME`,
  `DATA_SOURCE_FORMAT`,
  `STORAGE_SUB_DIRECTORY`
FROM INFORMATION_SCHEMA.TABLES;

If you're just working with cloud files (table not yet defined in hive metastore or Unity Catalog), you can use the Delta SDK built-in function DeltaTable.isDeltaTable(...) : https://docs.delta.io/latest/api/scala/io/delta/tables/DeltaTable$.html#isDeltaTable(sparkSession:org.apache.spark.sql.SparkSession,identifier:String):Boolean

Scala:

import io.delta.tables.DeltaTable

DeltaTable.isDeltaTable("s3://path/to/table/")
Zach King
  • 798
  • 1
  • 8
  • 21