I'm looking for the best way to parallelize on a cluster the following problem. I have several files
- folder/file001.csv
- folder/file002.csv
- :
- folder/file100.csv
They are disjoints with respect to the key
I want to use to groupby, that is if a set of keys is in file1.csv
any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of delayed-groupby way.
Every filexxx.csv
fits in memory on a node. Given that every node has n
cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.