3

I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster.

I originally tried to just run Dask locally with my dataset loaded in my local RAM, but obviously that's sending it over the network and the cluster is only running at 2% utilization when performing the task.

Is there a way to put the dataset onto the Kubernetes cluster so I can get 100% CPU utilization?

Brendan Martin
  • 561
  • 6
  • 17

1 Answers1

1

Many people store data on a cloud object store, like Amazon's S3, Google Cloud Storage.

If you're interested about Dask in particular these data stores are supported in most of the data ingestion functions by using a protocol like the following:

import dask.dataframe as dd
df = dd.read_csv('gcs://bucket/2018-*-*.csv')

You will need to also have the relevant Python library installed to access this cloud storage (gcsfs in this case). See http://dask.pydata.org/en/latest/remote-data-services.html#known-storage-implementations for more information.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • thanks. So if I use that method in my local Jupyter notebook, will it load on the cluster instead going through my computer first? – Brendan Martin Apr 05 '18 at 18:02
  • 1
    Loading of data will happen in the workers of the cluster. The local client will need to access the data store too, to find the number of files and load any metadata, typically a much smaller amount of bandwidth. – mdurant Apr 05 '18 at 18:17
  • @mdurant @mrocklin So I've got Dask reading my bucket dataset, but when I go to run it on the cluster I'm getting this: `/opt/conda/envs/dask/lib/python3.6/site-packages/distributed/protocol/pickle.py in loads() ModuleNotFoundError: No module named 'gcsfs' ` – Brendan Martin Apr 05 '18 at 22:40
  • I recommend running `pip install gcsfs` or `conda install gcsfs` wherever your workers are installed. If using dask-kubernetes then see the use of the `EXTRA_PIP_PACKAGES` environment variable in the [quickstart](http://dask-kubernetes.readthedocs.io/en/latest/#quickstart) – MRocklin Apr 06 '18 at 01:23