1

I have an external delta table partitioned by year in my storage, so when I try to "mount" the delta in Databricks table using this code:

CREATE TABLE if not exists profitek.products
                   USING DELTA                 
                   LOCATION 'dbfs:/mnt/curated-l1/profitek/products'

I get this error:

You are trying to create an external table spark_catalog.general.products from dbfs:/mnt/curated-l1/general/products using Databricks Delta, but there is no transaction log present at dbfs:/mnt/curated-l1/general/products/_delta_log. Check the upstream job to make sure that it is writing using format("delta") and that the path is the root of the table.

The _delta_log file is inside every partition

products
│
├── 2022-01-01
│   ├── _delta_log
│   ├── part_01.parquet
│   └── part_02.parquet
├── 2022-01-02
    ├── _delta_log
    ├── part_01.parquet
    └── part_02.parquet

So I can mount a PART of the table using something like this

CREATE TABLE if not exists profitek.products
                   USING DELTA                 
                   LOCATION 'dbfs:/mnt/curated-l1/profitek/products/2022-01-01'

How can I "mount" the full table, with all his partitions??

Andrés Bustamante
  • 442
  • 1
  • 4
  • 15

1 Answers1

1

Your delta table is not partitioned using delta's partitioning capabilities: df.write.format("delta").partitionBy(<date_column_name>).save(path)

If done in the way described above, the directory structure would look like this:

products
├── _delta_log
│   └── 0000000.json
├── 2022-01-01
│   ├── part_01.parquet
│   └── part_02.parquet
├── 2022-01-02
    ├── part_01.parquet
    └── part_02.parquet

As per the directory structure provided by you, it seems the delta tables were partitioned explicitly, i.e.: the partitions were explicitly written to those locations. Something like:

  • df_2022_01_01.write.format("delta").save("products/2022-01-01")
  • df_2022_01_02.write.format("delta").save("products/2022-01-02")

To get all the partitions in one df, you need to read individual ones and simply union them (assuming schema is same):

df_2022_01_01 = spark.read.format("delta").load("products/2022-01-01")
df_2022_01_02 = spark.read.format("delta").load("products/2022-01-02")
//then union the two dfs
df = df_2022_01_01.union(df_2022_01_02)

I've shown for 2 dfs only, you can use this approach to union dataframes in a for-loop.

Now that you have complete data in the df, you can partition it using delta for easier access later on: df.write.format("delta").partitionBy(<date_column_name>).save("/products/complete-delta")

o_O
  • 341
  • 3
  • 14