how to read data from multiple folder from adls to databricks dataframe

Question

file path format is data/year/weeknumber/no of day/data_hour.parquet

data/2022/05/01/00/data_00.parquet

data/2022/05/01/01/data_01.parquet

data/2022/05/01/02/data_02.parquet

data/2022/05/01/03/data_03.parquet

data/2022/05/01/04/data_04.parquet

data/2022/05/01/05/data_05.parquet

data/2022/05/01/06/data_06.parquet

data/2022/05/01/07/data_07.parquet

how to read all this file one by one in data bricks notebook and store into the data frame

import pandas as pd 

#Get all the files under the folder
data = dbutils.fs.la(file)

df = pd.DataFrame(data)

#Create the list of file
list = df.path.tolist()

    enter code here

for i in list:
    df = spark.read.load(path=f'{f}*',format='parquet')

i can able to read only the last file skipping the other file

Hi, at least, share some your codes that you have been trying and failed. Anyway, [this answer](https://stackoverflow.com/a/32233865) may be what you are looking for. — Phuri Chalermkiatsakul, May 10 '22 at 08:16
please find the above code which i am using, was looking for parquet files not for .txt — heena shaikh, May 10 '22 at 14:39

Phuri Chalermkiatsakul · Answer 1 · 2022-05-13T18:58:19.043

0

The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.

Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)

df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')

This is what I have applied from the same answer I shared with you in the comment.

edited May 13 '22 at 18:58

answered May 11 '22 at 03:05

Phuri Chalermkiatsakul

551
4
10

days = list["2020/05/06","2020/05/07","2020/05/08","2020/05/09","2020/05/10"] (yy/mm/dd) I need to read the data from folder mention only in the list – heena shaikh May 13 '22 at 13:01
days = List['2022/05/06/*/*','2022/05/07/*/*'] paths = days.map(day => "data/" ++day) df=spark.read.format("parquet").load(paths) I try this but it didn't work – heena shaikh May 13 '22 at 13:44
Got it. So you can leave the for loop and apply this code instead `df = df.union(spark.read.load(path=f'{i}',format='parquet'))` ...Kindly accept my answer if this works, Thanks. – Phuri Chalermkiatsakul May 13 '22 at 18:57
How come you got this error "df not defined" after this line `df = pd.DataFrame(data)` ? – Phuri Chalermkiatsakul May 23 '22 at 07:13

how to read data from multiple folder from adls to databricks dataframe

1 Answers1