I have deployed a kubernetes cluster on GCP with a combination of prefect and dask. The jobs run fine in a normal scenario but it is failing to scale for 2 times the data. So far, I have narrowed it down to scheduler getting shut off due to high memory usage. Dask scheduler Memory As soon as the memory usage touches 2GB, the jobs get failed up with "no heartbeat detected" error.
There is a separate build python file available where we set worker memory and cpu. There is a dask-gateway package where we get the Gateway options and set up the worker memory.
options.worker_memory = 32
options.worker_cores = 10
cluster = gateway.new_cluster(options)
cluster.adapt(minimum=4, maximum=20)
I am unable to figure out where and how can i increase the memory allocation for dask-scheduler.
Specs:
Cluster Version: 1.19.14-gke.1900
Machine type - n1-highmem-64
Autoscaling set to 6 - 1000 nodes per zone
all nodes are allocated 63.77 CPU and 423.26 GB