I have folder with many excel files, that I need to read as DataFrame. Each file is like 100-300 Mb, and it takes many minutes to read a single file.
My CPU uses only 1 core (out of 8) while reading these files.
How to read them in parallel? I wrote
def convert_file_to_pickled_df(path, fname):
df = pd.read_excel((path+fname))
# do some other things
return df
path = 'D:/'
filenames=['2017_1.xlsx', '2017_2.xlsx', '2017_3.xlsx', '2017_4.xlsx']
pool = mp.Pool(mp.cpu_count())
results = [pool.apply(convert_file_to_pickled_df, args=(path, fname)) for fname in filenames]
pool.close()
But it looks like it doesn't work. I still have only 1 core loaded.