0

I have setup Dask and JupyterHub on a Kubernetes cluster using Helm with the help of the Dask documentation: http://docs.dask.org/en/latest/setup/kubernetes.html.

Everything deployed fine and I can access the JupyterLab. Then I've create a notebook and downloaded a csv file from a Google Cloud Storage bucket:

storage_client = storage.Client.from_service_account_json(CREDENTIALS)
bucket = storage_client.get_bucket(BUCKET)
download_blob(bucket, file="test-file", destination_dir="data/")

I read in the csv file:

import dask.dataframe as dd
df = dd.read_csv("/home/jovyan/data/*.csv")

I initialize Dask Client so that I can monitor the computation analytics:

from dask.distributed import Client, config
client = Client()

So far so good until I try to interact with the data frame. F.e. when I try to do df.head() I get the error:

[Errno 2] No such file or directory: '/home/jovyan/data/test-file.csv'

Why can't the other workers find the DataFrame? I thought the DataFrame was shared among the memory of all the workers.

Note: At first I was using df.head() without having a Dask Client and that worked but I didn't see any diagnostics so I've add the client = Client().

Stanko
  • 4,275
  • 3
  • 23
  • 51

1 Answers1

1

You have downloaded the file to the node in which your client is running but the workers, on other nodes in kubernetes, do not have access to that file-system and cannot therefore load the file.

The simplest solution here is to use Dask's native ability to talk with GCS. Yo do not need a local copy of your data at all. You should install gcsfs, and then try:

df = dd.read_csv("gcs://<BUCKET>/test-file.csv", storage_options={'token': CREDENTIALS})

(or you may wish to distribute credentials to your workers by other more secure means).

If you did want a local copy of your data (some loaders cannot take advantage of remote file-systems, for instance), then you would need a shared file-system between the client and workers of your Dask cluster, which would take some kubernetes-foo to achieve.

Further information: http://docs.dask.org/en/latest/remote-data-services.html

mdurant
  • 27,272
  • 5
  • 45
  • 74