Can Databricks Auto loader infer partitions?

Question

By default, when you're using Hive partitions directory structure,the auto loader option cloudFiles.partitionColumns add these columns automatically to your schema (using schema inference).

This is the code:

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
    spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .load("s3://autoloader-source/json-data")
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable("dev_catalog.dev_database.dev_table")
)

But can we have an option to also create partitionq to the target table like you can do with a simple CREATE TABLE ? (E.g. if you have such classical structure /year=xxxx/month=xxx/day=xx)

Could you provide the code for the write to the target table? — Robert Kossendey, Dec 15 '22 at 16:08
Nothing more than the code from the Databricks documentation ```checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table" (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", checkpoint_path) .load("s3://autoloader-source/json-data") .writeStream .option("checkpointLocation", checkpoint_path) .trigger(availableNow=True) .toTable("dev_catalog.dev_database.dev_table"))``` — alxsbn, Dec 15 '22 at 16:18

score 2 · Accepted Answer · answered Dec 15 '22 at 16:57

You can use the .partitionBy() function.

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
    spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .load("s3://autoloader-source/json-data")
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .partitionBy("col1", "col2")
    .trigger(availableNow=True)
    .toTable("dev_catalog.dev_database.dev_table")
)

Can Databricks Auto loader infer partitions?

1 Answers1