How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a sequence of operations that I am currently able to perform in Pyspark/Hive/HDFS setup, wonder how can I establish the same on Azure.
Actions in sequence-
- Create a dataframe DF
- Drop Hive external table if exists, load dataframe DF to this external table using
DF.write.insertInto("table")
- create a dataframe DF1
- Drop Hive external table if exists, load dataframe DF1 to this external table using
DF1.write.insertInto("table")
Even though I perform "drop tables if exists" before loading 2nd dataframe, if I query the "table" after step 4, I can see content from both dataframes because I am just "dropping" the table structure and not the actual data (Hive External Table). Here is how it looks-
>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
| s| d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+
I am trying to perform the similar using Azure Delta lake Table with below steps-
- Create the dataframes.
- Save dataframes in ADLS (probably this is what I am doing wrong here, should I use mount dbfs path instead of container?)
- Create unmanaged table on top of this path.
Here is the code in my Databricks notebook.
df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://mycontainer@xxxxxxxxxxxx.dfs.core.windows.net/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://mycontainer@xxxxxxxxxxxx.dfs.core.windows.net/stage/test_table'")
However it is not writing the dataframe in the location table_path and final step fails to create the table (probably a dbfs: mount path is required here?). How can I perform similar operations using unmanaged Delta lake tables?