1

I have folder with many excel files, that I need to read as DataFrame. Each file is like 100-300 Mb, and it takes many minutes to read a single file.

My CPU uses only 1 core (out of 8) while reading these files.

How to read them in parallel? I wrote

def convert_file_to_pickled_df(path, fname):
    df = pd.read_excel((path+fname))
    # do some other things
    return df

path = 'D:/'
filenames=['2017_1.xlsx', '2017_2.xlsx', '2017_3.xlsx', '2017_4.xlsx']

pool = mp.Pool(mp.cpu_count())
results = [pool.apply(convert_file_to_pickled_df, args=(path, fname)) for fname in filenames]     
pool.close()

But it looks like it doesn't work. I still have only 1 core loaded.

DDR
  • 459
  • 5
  • 15
  • 1
    I can highly recommend pandarallel (`pip install pandarallel`) ; I don't beleive it can parallelize the file read but it can do so for most everything else (cf. the `#do some other things` in your example) , e.g. instead of `df.apply(func)` use `df.parallel_apply(func)` - its that easy ! – jeremy_rutman Dec 04 '19 at 12:27
  • I assume its a xlrd engine, that can't be parallelized. – DDR Dec 04 '19 at 12:28
  • 1
    https://stackoverflow.com/a/53445829/9375102 – Umar.H Dec 04 '19 at 12:36

0 Answers0