I have setup Dask and JupyterHub on a Kubernetes cluster using Helm with the help of the Dask documentation: http://docs.dask.org/en/latest/setup/kubernetes.html.
Everything deployed fine and I can access the JupyterLab. Then I've create a notebook and downloaded a csv file from a Google Cloud Storage bucket:
storage_client = storage.Client.from_service_account_json(CREDENTIALS)
bucket = storage_client.get_bucket(BUCKET)
download_blob(bucket, file="test-file", destination_dir="data/")
I read in the csv file:
import dask.dataframe as dd
df = dd.read_csv("/home/jovyan/data/*.csv")
I initialize Dask Client so that I can monitor the computation analytics:
from dask.distributed import Client, config
client = Client()
So far so good until I try to interact with the data frame. F.e. when I try to do df.head()
I get the error:
[Errno 2] No such file or directory: '/home/jovyan/data/test-file.csv'
Why can't the other workers find the DataFrame? I thought the DataFrame was shared among the memory of all the workers.
Note: At first I was using df.head()
without having a Dask Client and that worked but I didn't see any diagnostics so I've add the client = Client()
.