I'm using databricks service for analysis. I have built a connection with ADLS gen2 storage and create a mountpoint and now that container contains multiple folder for years and months and having parquet files for each month inside month folders. I have to read all those files and create a single target file with complete months data. How do I achieve it can anyone suggest?
Read files from multiple folders from ADLS gen2 storage via databricks and create single target file
Asked
Active
Viewed 966 times
1
-
is the data partitioned - your directories have names like `col=value`, or just directories? – Alex Ott Nov 03 '21 at 21:06
-
Just directories – Arun Nov 05 '21 at 20:02
-
Can you share the file type, schema of the files, and what the schema of the final output would be as well? – Will J Nov 06 '21 at 01:03
1 Answers
0
Assuming your parquet files follow a specific directory pattern, you can use wildcards.
If your files are written in a pattern like /mnt/point/folder/YYYY/MM/foo.parquet
you can traverse all YYYY and MM folders and files with /mnt/point/folder/*/*
Here's a reproducible pyspark example assuming you have a mount point called 'data`.
df_A = spark.createDataFrame([
['a',1],
['a',2],
['a',3]
], ["Letter", "Number"])
df_B = spark.createDataFrame([
['b',1],
['b',2],
['b',3]
], ["Letter", "Number"])
df_A.write.parquet('/mnt/data/mydata/1999/01')
df_B.write.parquet('/mnt/data/mydata/2001/09')
new_df = spark.read.parquet('/mnt/data/mydata/*/*')
Per @Alex Ott's comment, if your data were partitioned (e.g. had a folder named year=1999 and subfolders named month=01, month=02, etc.) you could take advantage of partition discovery and spark would more intelligently realize that it should traverse all sub-folders.

Will J
- 347
- 1
- 10