1

I'm using databricks service for analysis. I have built a connection with ADLS gen2 storage and create a mountpoint and now that container contains multiple folder for years and months and having parquet files for each month inside month folders. I have to read all those files and create a single target file with complete months data. How do I achieve it can anyone suggest?

Arun
  • 57
  • 7

1 Answers1

0

Assuming your parquet files follow a specific directory pattern, you can use wildcards.

If your files are written in a pattern like /mnt/point/folder/YYYY/MM/foo.parquet you can traverse all YYYY and MM folders and files with /mnt/point/folder/*/*

Here's a reproducible pyspark example assuming you have a mount point called 'data`.

df_A = spark.createDataFrame([
    ['a',1],
    ['a',2],
    ['a',3]
], ["Letter", "Number"])
df_B = spark.createDataFrame([
    ['b',1],
    ['b',2],
    ['b',3]
], ["Letter", "Number"])

df_A.write.parquet('/mnt/data/mydata/1999/01')
df_B.write.parquet('/mnt/data/mydata/2001/09')

new_df = spark.read.parquet('/mnt/data/mydata/*/*')

Per @Alex Ott's comment, if your data were partitioned (e.g. had a folder named year=1999 and subfolders named month=01, month=02, etc.) you could take advantage of partition discovery and spark would more intelligently realize that it should traverse all sub-folders.

Will J
  • 347
  • 1
  • 10