0

Noob here. I want to have a Dask install with a worker pool that can grow and shrink based on current demands. I followed the instructions in zero to jupyterhub to install on GKE, and then went through the install instructions for dask-kubernetes: https://kubernetes.dask.org/en/latest/.

I originally ran into some permissions issues, so I created a service account with all permissions and changed my config.yaml to use this service account. That got rid of the permissions issues, but now when I run this script, with the default worker-spec.yml, I get no workers:

cluster = KubeCluster.from_yaml('worker-spec.yml')
cluster.scale_up(4)  # specify number of nodes explicitly

client = distributed.Client(cluster)
client
Cluster

    Workers: 0
    Cores: 0
    Memory: 0 B

When I list my pods, I see a lot of workers in the pending state:

patrick_mineault@cloudshell:~ (neuron-264716)$ kubectl get pod --namespace jhub                                                                                                                   
NAME                          READY   STATUS    RESTARTS   AGE
dask-jovyan-24034fcc-22qw7w   0/1     Pending   0          45m
dask-jovyan-24034fcc-25h89q   0/1     Pending   0          45m
dask-jovyan-24034fcc-2bpt25   0/1     Pending   0          45m
dask-jovyan-24034fcc-2dthg6   0/1     Pending   0          45m
dask-jovyan-25b11132-52rn6k   0/1     Pending   0          26m
...

And when I describe each pod, I see that there's an insufficient memory, cpu error:

Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  69s (x22 over 30m)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.

Do I need to manually create a new autoscaling pool in GKE or something? I only have one pool now, the one which runs jupyterlab, and that pool is already fully committed. I can't figure out what piece of configuration causes dask to figure out in which pool to put the workers.

Patrick Mineault
  • 741
  • 5
  • 11
  • 1
    It sounds like there aren't enough free resources in your autoscaling pool to allow your workers to be scheduled. The default `worker-spec.yaml` file requests two vCPUs and 6GB of RAM per worker. If that is not available on any machine in your pool and your poll is scaled up to the maximum then you'll get the errors above. – Jacob Tomlinson Jan 16 '20 at 15:09
  • Thank you - indeed, I've found that there were no nodes that could fit these big workers - I had to create a pool for that. – Patrick Mineault Jan 16 '20 at 16:49

1 Answers1

1

I indeed needed to create a flexible, scalable worker pool to host the workers - there's an example of this in the Pangeo setup guide: https://github.com/pangeo-data/pangeo/blob/master/gce/setup-guide/1_create_cluster.sh. This is the relevant line:

gcloud container node-pools create worker-pool --zone=$ZONE --cluster=$CLUSTER_NAME \
    --machine-type=$WORKER_MACHINE_TYPE --preemptible --num-nodes=$MIN_WORKER_NODES
Patrick Mineault
  • 741
  • 5
  • 11