How to parallel compute csv file store on each worker without use hdfs?

Question

A concept same as data localy on hadoop but I don't want to use hdfs.

I have 3 dask-worker .

I want to compute a big csv filename for example mydata.csv.

I split mydata.csv to small file (mydata_part_001.csv ... mydata_part_100.csv) and store in local folder /data on each worker e.g.

worker-01 store mydata_part_001.csv - mydata_part_030.csv in local folder /data

worker-02 store mydata_part_031.csv - mydata_part_060.csv in local folder /data

worker-03 store mydata_part_061.csv - mydata_part_100.csv in local folder /data

how to use dask compute to mydata ?? Thank.

Are your workers living on the same machine or on a distributed cluster? — FlorianP, Oct 18 '19 at 08:50

score 0 · Answer 1 · answered Oct 19 '19 at 13:31

It is more common to use some sort of globally accessible file system. HDFS is one example of this, but several other Network File Systems (NFSs) exist. I recommend looking into these instead of managing your data yourself in this way.

However, if you want to do things this way then you are probably looking for Dask's worker resources, which allow you to target specific tasks to specific machines.

How to parallel compute csv file store on each worker without use hdfs?

1 Answers1