0

How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a sequence of operations that I am currently able to perform in Pyspark/Hive/HDFS setup, wonder how can I establish the same on Azure.

Actions in sequence-

  1. Create a dataframe DF
  2. Drop Hive external table if exists, load dataframe DF to this external table using DF.write.insertInto("table")
  3. create a dataframe DF1
  4. Drop Hive external table if exists, load dataframe DF1 to this external table using DF1.write.insertInto("table")

Even though I perform "drop tables if exists" before loading 2nd dataframe, if I query the "table" after step 4, I can see content from both dataframes because I am just "dropping" the table structure and not the actual data (Hive External Table). Here is how it looks-

>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
|    s|   d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+

I am trying to perform the similar using Azure Delta lake Table with below steps-

  1. Create the dataframes.
  2. Save dataframes in ADLS (probably this is what I am doing wrong here, should I use mount dbfs path instead of container?)
  3. Create unmanaged table on top of this path.

Here is the code in my Databricks notebook.

df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://mycontainer@xxxxxxxxxxxx.dfs.core.windows.net/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://mycontainer@xxxxxxxxxxxx.dfs.core.windows.net/stage/test_table'")

However it is not writing the dataframe in the location table_path and final step fails to create the table (probably a dbfs: mount path is required here?). How can I perform similar operations using unmanaged Delta lake tables?

Dipanjan Mallick
  • 1,636
  • 2
  • 8
  • 20
Sidd
  • 261
  • 1
  • 6
  • 24
  • Did you check the `spark-config` under the `cluster` `Advanced Settings` menu? Essentially, there are two ways, either you mount your ADLS location or mention its details with `spark-config`. Having said that, coming to your code see this [link](https://stackoverflow.com/questions/56006982/databricks-failing-to-write-from-a-dataframe-to-a-delta-location). Hopefully, these should solve your issue. – Dipanjan Mallick Mar 31 '22 at 03:46
  • Try to add saveAsTable at the end `df.write.format("delta").mode("overwrite").option("path", table_path).saveAsTable('table_name)`. – intruderr Apr 01 '22 at 22:38

1 Answers1

0
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
  COMMENT 'This table is created with existing data'
  LOCATION 'spark-warehouse/tables/my_existing_table'

Use the above method to create an external table in Delta Lake like Hive manner.

For complete reference, Check the below web doc link https://learn.microsoft.com/en-us/azure/databricks/spark/2.x/spark-sql/language-manual/create-table

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [AS select_statement]

[LOCATION path] helps to give the external path. This clause will take care of external table creation

Sairam Tadepalli
  • 1,563
  • 1
  • 3
  • 11