To better understand Dask I decided to set up a small Dask cluster: two servers 32GB RAM and a Mac. All are part of a local LAN and all run identical version of Python 3.5 + Dask installed under virtual environment. I installed sshfs on both servers to share the data between workers. I was able to start dask-scheduler on 192.168.2.149 and 4 dask-workers on 192.168.2.26.
What I need help with is conceptual understanding of the topology to fully benefit from dask distributed architecture: - I run my experiments on my Mac, which is part of the LAN. I have a 20 GB csv I need to load into Pandas hence I run my py code locally. In my code, I set up a Dask client to use the dask_scheduler:
client = Client('192.168.2.149:8786')
then I try to load the large csv like this:
df = dd.read_csv("exp3_raw_data.csv", sep="\t")
The csv is only present on my mac so the dask_workers do not know anything about the csv. If I move the csv to the directory shared via sshfs, then how would my mac reference that csv?
Any help is appreciated.