1

I am using Databricks' Auto Loader functionality to process JSON files from a directory and save them into a Delta table in another subdirectory.

My code looks like this:

transporters = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json") 
.option("recursiveFileLookup", "true") 
.schema(transporters_schema)
.load(source_files_path)
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", auto_loader_checkpoints_path)
.trigger(availableNow=True)
.start(target_table_path)
)

For some reason, the Delta tables contains hundreds of subdirectories containing Parquet files like the following:

01
|___
    part_00001_fsdgwsdg_afafafafa.snappy.parquet
    part_00002_fsdgwsdg_afafafafa.snappy.parquet
    part_00003_fsdgwsdg_afafafafa.snappy.parquet
02
03
0f
0J
0j
0K
0o
0R
...

I did not expect these subdirectories to appear. What could be causing them?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

1 Answers1

0

Most probably, it's not the Auto Loader, but the Delta Lake itself - Auto Loader itself doesn't create directories with data. Check that you don't have the delta.randomizeFilePrefixes property set to true on your Delta table.

See documentation for more details - it says:

true for Delta Lake to generate a random prefix for a file path instead of partition information. For example, this may improve Amazon S3 performance when Delta Lake needs to send very high volumes of Amazon S3 calls to better partition across S3 servers.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132