Read files from multiple folders from ADLS gen2 storage via databricks and create single target file

Question

I'm using databricks service for analysis. I have built a connection with ADLS gen2 storage and create a mountpoint and now that container contains multiple folder for years and months and having parquet files for each month inside month folders. I have to read all those files and create a single target file with complete months data. How do I achieve it can anyone suggest?

is the data partitioned - your directories have names like `col=value`, or just directories? — Alex Ott, Nov 03 '21 at 21:06
Can you share the file type, schema of the files, and what the schema of the final output would be as well? — Will J, Nov 06 '21 at 01:03

score 0 · Answer 1 · answered Nov 06 '21 at 01:58

Assuming your parquet files follow a specific directory pattern, you can use wildcards.

If your files are written in a pattern like /mnt/point/folder/YYYY/MM/foo.parquet you can traverse all YYYY and MM folders and files with /mnt/point/folder/*/*

Here's a reproducible pyspark example assuming you have a mount point called 'data`.

df_A = spark.createDataFrame([
    ['a',1],
    ['a',2],
    ['a',3]
], ["Letter", "Number"])
df_B = spark.createDataFrame([
    ['b',1],
    ['b',2],
    ['b',3]
], ["Letter", "Number"])

df_A.write.parquet('/mnt/data/mydata/1999/01')
df_B.write.parquet('/mnt/data/mydata/2001/09')

new_df = spark.read.parquet('/mnt/data/mydata/*/*')

Per @Alex Ott's comment, if your data were partitioned (e.g. had a folder named year=1999 and subfolders named month=01, month=02, etc.) you could take advantage of partition discovery and spark would more intelligently realize that it should traverse all sub-folders.

Read files from multiple folders from ADLS gen2 storage via databricks and create single target file

1 Answers1