How to increase scheduler memory in GKE for DASK

Question

I have deployed a kubernetes cluster on GCP with a combination of prefect and dask. The jobs run fine in a normal scenario but it is failing to scale for 2 times the data. So far, I have narrowed it down to scheduler getting shut off due to high memory usage. Dask scheduler Memory As soon as the memory usage touches 2GB, the jobs get failed up with "no heartbeat detected" error.

There is a separate build python file available where we set worker memory and cpu. There is a dask-gateway package where we get the Gateway options and set up the worker memory.

options.worker_memory = 32
options.worker_cores = 10
cluster = gateway.new_cluster(options)
cluster.adapt(minimum=4, maximum=20)

I am unable to figure out where and how can i increase the memory allocation for dask-scheduler.

Specs:
Cluster Version: 1.19.14-gke.1900
Machine type - n1-highmem-64
Autoscaling set to 6 - 1000 nodes per zone
all nodes are allocated 63.77 CPU and 423.26 GB

Anna Geller · Answer 1 · 2021-11-15T17:13:35.187

To explain why Flow heartbeats exist in the first place: Prefect uses heartbeats to check if your flow is still running. If Prefect didn’t have heartbeats, flows that lost communication and died on a remote execution environment such as a Kubernetes job, would permanently be shown as Running in the UI. Usually "no heartbeat detected" happens as a result of running out of memory, or when your flows execute long-running jobs.

One solution that you could try is to set the following environment variable on your run configuration - this will change the heartbeats behavior from processes to threads and can help solve the issue:

from prefect.run_configs import UniversalRun

flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})

As you mentioned, the best solution would be to increase the memory of your Dask workers. If you use a long-running cluster, you can set it this way:

dask-worker tcp://scheduler:port --memory-limit="4 GiB"

And if you pass a cluster class to your Dask executor, e.g. coiled.Cluster, you can set both:

scheduler_memory - defaults to 4 GiB
worker_memory - defaults to 8 GiB

Here is how you could set that on your flow:

import coiled
from prefect import Flow
from prefect.executors import DaskExecutor

flow = Flow("test-flow")
executor = DaskExecutor(
    cluster_class=coiled.Cluster,
    cluster_kwargs={
        "software": "user/software_env_name",
        "shutdown_on_close": True,
        "name": "prefect-cluster",
        "scheduler_memory": "4 GiB",
        "worker_memory": "8 GiB",
    },
)
flow.executor = executor

How to increase scheduler memory in GKE for DASK

1 Answers1