1

file path format is data/year/weeknumber/no of day/data_hour.parquet

data/2022/05/01/00/data_00.parquet

data/2022/05/01/01/data_01.parquet

data/2022/05/01/02/data_02.parquet

data/2022/05/01/03/data_03.parquet

data/2022/05/01/04/data_04.parquet

data/2022/05/01/05/data_05.parquet

data/2022/05/01/06/data_06.parquet

data/2022/05/01/07/data_07.parquet

how to read all this file one by one in data bricks notebook and store into the data frame

import pandas as pd 

#Get all the files under the folder
data = dbutils.fs.la(file)

df = pd.DataFrame(data)

#Create the list of file
list = df.path.tolist()

    enter code here

for i in list:
    df = spark.read.load(path=f'{f}*',format='parquet')

i can able to read only the last file skipping the other file

1 Answers1

0

The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.

Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)

df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')

This is what I have applied from the same answer I shared with you in the comment.

  • days = list["2020/05/06","2020/05/07","2020/05/08","2020/05/09","2020/05/10"] (yy/mm/dd) I need to read the data from folder mention only in the list – heena shaikh May 13 '22 at 13:01
  • days = List['2022/05/06/*/*','2022/05/07/*/*'] paths = days.map(day => "data/" ++day) df=spark.read.format("parquet").load(paths) I try this but it didn't work – heena shaikh May 13 '22 at 13:44
  • Got it. So you can leave the for loop and apply this code instead `df = df.union(spark.read.load(path=f'{i}',format='parquet'))` ...Kindly accept my answer if this works, Thanks. – Phuri Chalermkiatsakul May 13 '22 at 18:57
  • How come you got this error "df not defined" after this line `df = pd.DataFrame(data)` ? – Phuri Chalermkiatsakul May 23 '22 at 07:13