0

I'm looking for the best way to parallelize on a cluster the following problem. I have several files

  • folder/file001.csv
  • folder/file002.csv
  • :
  • folder/file100.csv

They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.

In one side I can just run

df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes') 

But I'm wondering if there is a better/smarter way to do so in a sort of delayed-groupby way.

Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way

import numpy as np
import multiprocessing as mp

cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = mp.Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

data = parallelize(data, f);

And, again, I'm not sure if there is an efficent dask way to do so.

rpanai
  • 12,515
  • 2
  • 42
  • 64

1 Answers1

-1

you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.

however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."

so i think the best way to do it is a simple:

from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")

EDIT: after that use map_partitions and do the gorupby for each partition:

# Note ddf is a dask dataframe and df is a pandas dataframe 
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)

don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.

moshevi
  • 4,999
  • 5
  • 33
  • 50
  • Thank you for your answer but my question is pointing on another direction. In my situation the problem should be embarrassingly parallel and using `dask.dataframe` I guess I would have unneeded shuffling. – rpanai Jul 19 '18 at 15:49
  • 1
    you could set the index as the `Key` column, then repartition it according to it values (with the `divisions` kwarg) and then use `map_partitions`. thus achieving complete parallel computing. – moshevi Jul 21 '18 at 09:08
  • I'll try but I have the feeling that the `set_index` is going to be expensive. – rpanai Jul 23 '18 at 13:35
  • it will, i had a similar situation and that's best i could think of. however before you do so i suggest reading [this](https://dask.pydata.org/en/latest/dataframe-performance.html) – moshevi Jul 23 '18 at 18:29
  • I gave a try and it is very expensive. And my data is already nicely partitioned. This is way I was thinking to use `distributed` for every parquet file as in [this](https://github.com/TomAugspurger/dask-tutorial-pycon-2018/blob/master/01-dask.delayed.ipynb) Section Sequential code: Mean Departure Delay Per Airport – rpanai Jul 23 '18 at 18:34
  • but you said the same `Key` can be in different `parquet` files. a simple `dask` `groupby` is parallel but it results in a single partition in the end (hence the slowness) because it has no way of knowing in which partition a value that relates to each "group" resides. if you will create your own map-reduce process using `distributed` you'll basically be doing the same thing as a `dask` `groupby`. – moshevi Jul 23 '18 at 18:46
  • I said that The files are disjoints with respect to the `key`. So no `key` is on different files – rpanai Jul 23 '18 at 18:49
  • 1
    ohh sorry, i misunderstood you.will edit in the answer . – moshevi Jul 23 '18 at 18:52
  • I'll try your edit soon. I didn't know about the trick with output method. – rpanai Jul 24 '18 at 12:51