I have some Python (3.8) code that does the following:
- Walks directory and subdirectories of a given path
- Finds all .csv files
- Finds all .csv files with 'Pct' in filename
- Joins path and file
- Reads CSV
- Adds filename to df
- Concatonates all dfs together
The code below works, but takes a long time (15mins) to ingest all the CSV's - there are 52,000 files. This might not in fact be a long time, but I want to reduce this as much as possible.
My current working code is below:
start_dirctory='/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/' # change this
df_result= None
#loop_number = 0
for path, dirs, files in os.walk(start_dirctory):
for file in sorted(fnmatch.filter(files, '*.csv')): # find .csv files
# print(file)
if 'Pct' in file: # filter if contains 'Pct'
# print('Pct = ', file)
full_name=os.path.join(path, file) # make full file path
df_tmp= pd.read_csv(full_name, header=None) # read file to df_tmp
df_tmp['file']=os.path.basename(file) # df.file = file name
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
#print(full_name, 'imported')
#loop_number = loop_number + 1
#print('Loop number =', loop_number)
Inspired by this post (glob to find files recursively) and this post (how to speed up importing csvs), I have tried to reduce the time that it takes to ingest all the data, but can't figure out a way to integrate a filer for only filenames that contain 'Pct' and then to add the filename to the df. This might not be possible with the code from these examples.
What I have tried below (incomplete):
%%time
import glob
import pandas as pd
df = pd.concat(
[pd.read_csv(f, header=None)
for f in glob.glob('/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/**/*.csv', recursive=True)
],
axis='index', ignore_index=True
)
Question
Is there any way that I can reduce the time to read and ingest the CSV's in my code above?
Thanks!