Growing memory usage (leak?) in Dask Distributed profiler

Question

I have a longish running task that I submit to a Dask cluster (worker is running 1 process and 1 thread) and I use tracemalloc to track memory usage. The task can run long enough that memory usage builds up and has caused all sorts of problems. Here is the structure of how I used tracemalloc.

def task():
    tracemalloc.start()
    ...
    snapshot1 = tracemalloc.take_snapshot()
    for i in range(10):
        ...
        snapshot2 = tracemalloc.take_snapshot()
        top_stats = snapshot2.compare_to(snapshot1, "lineno")
        print("[ Top 6 differences ]")
        for stat in top_stats[:6]:
            print(str(stat))

I get the following (cleaned up a tad) which shows that the profiler in Dask Distributed is accruing memory. This was after the second iteration and these memory numbers grow linearly.

[ Top 6 differences ]
/usr/local/lib/python3.8/site-packages/distributed/profile.py:112:
    size=137 MiB (+113 MiB), count=1344168 (+1108779), average=107 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:68:
    size=135 MiB (+110 MiB), count=1329005 (+1095393), average=106 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:48:
    size=93.7 MiB (+78.6 MiB), count=787568 (+655590), average=125 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:118:
    size=82.3 MiB (+66.5 MiB), count=513462 (+414447), average=168 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:67:
    size=64.4 MiB (+53.1 MiB), count=778747 (+647905), average=87 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:115:
    size=48.1 MiB (+40.0 MiB), count=787415 (+655449), average=64 B

Does anyone know how to clean out the profiler or not use it (we're not using the dashboard so we don't need it)?

Hmmm, the overhead from the profiler certainly shouldn't be that high, I've opened an issue to track further: https://github.com/dask/distributed/issues/4091. If you could comment there with a reproducible example showing your issue that'd be quite useful. Thanks! — jiminy_crist, Sep 01 '20 at 16:17

Alex P · Answer 1 · 2020-09-01T22:49:02.737

4

I set the following environment variables on the worker pods so this would dramatically reduce profiling. It seems to be working.

DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms 
DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms

The defaults can be found here: https://github.com/dask/distributed/blob/master/distributed/distributed.yaml#L74-L76

ETA: @rpanai This is what we in the K8s manifest for the deployment

spec:
  template:
    spec:
      containers:
      - env:
        - name: DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL
          value: 10000ms
        - name: DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE
          value: 1000000ms

edited Sep 01 '20 at 22:49

answered Sep 01 '20 at 02:37

Alex P

71
5

Do you mind to tell how you set these variable within the pod? – rpanai Sep 01 '20 at 22:36
Setting `distributed.admin.tick.interval` setting to a higher number, eg. `100ms` (it's 20 by default) can also be helpful (taken from https://github.com/dask/distributed/issues/2156#issuecomment-503735503). – Michał Zawadzki Apr 19 '22 at 21:38

Growing memory usage (leak?) in Dask Distributed profiler

1 Answers1