0

By default, when you're using Hive partitions directory structure,the auto loader option cloudFiles.partitionColumns add these columns automatically to your schema (using schema inference).

This is the code:

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
    spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .load("s3://autoloader-source/json-data")
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable("dev_catalog.dev_database.dev_table")
)

But can we have an option to also create partitionq to the target table like you can do with a simple CREATE TABLE ? (E.g. if you have such classical structure /year=xxxx/month=xxx/day=xx)

Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
alxsbn
  • 340
  • 2
  • 14
  • Could you provide the code for the write to the target table? – Robert Kossendey Dec 15 '22 at 16:08
  • Nothing more than the code from the Databricks documentation ```checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table" (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", checkpoint_path) .load("s3://autoloader-source/json-data") .writeStream .option("checkpointLocation", checkpoint_path) .trigger(availableNow=True) .toTable("dev_catalog.dev_database.dev_table"))``` – alxsbn Dec 15 '22 at 16:18

1 Answers1

2

You can use the .partitionBy() function.

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
    spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .load("s3://autoloader-source/json-data")
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .partitionBy("col1", "col2")
    .trigger(availableNow=True)
    .toTable("dev_catalog.dev_database.dev_table")
)
Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42