3

I have a longish running task that I submit to a Dask cluster (worker is running 1 process and 1 thread) and I use tracemalloc to track memory usage. The task can run long enough that memory usage builds up and has caused all sorts of problems. Here is the structure of how I used tracemalloc.

def task():
    tracemalloc.start()
    ...
    snapshot1 = tracemalloc.take_snapshot()
    for i in range(10):
        ...
        snapshot2 = tracemalloc.take_snapshot()
        top_stats = snapshot2.compare_to(snapshot1, "lineno")
        print("[ Top 6 differences ]")
        for stat in top_stats[:6]:
            print(str(stat))

I get the following (cleaned up a tad) which shows that the profiler in Dask Distributed is accruing memory. This was after the second iteration and these memory numbers grow linearly.

[ Top 6 differences ]
/usr/local/lib/python3.8/site-packages/distributed/profile.py:112:
    size=137 MiB (+113 MiB), count=1344168 (+1108779), average=107 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:68:
    size=135 MiB (+110 MiB), count=1329005 (+1095393), average=106 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:48:
    size=93.7 MiB (+78.6 MiB), count=787568 (+655590), average=125 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:118:
    size=82.3 MiB (+66.5 MiB), count=513462 (+414447), average=168 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:67:
    size=64.4 MiB (+53.1 MiB), count=778747 (+647905), average=87 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:115:
    size=48.1 MiB (+40.0 MiB), count=787415 (+655449), average=64 B

Does anyone know how to clean out the profiler or not use it (we're not using the dashboard so we don't need it)?

Alex P
  • 71
  • 5
  • 2
    Hmmm, the overhead from the profiler certainly shouldn't be that high, I've opened an issue to track further: https://github.com/dask/distributed/issues/4091. If you could comment there with a reproducible example showing your issue that'd be quite useful. Thanks! – jiminy_crist Sep 01 '20 at 16:17
  • Hey, thanks. Will see what I can do. – Alex P Sep 01 '20 at 18:08

1 Answers1

4

I set the following environment variables on the worker pods so this would dramatically reduce profiling. It seems to be working.

DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms 
DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms

The defaults can be found here: https://github.com/dask/distributed/blob/master/distributed/distributed.yaml#L74-L76

ETA: @rpanai This is what we in the K8s manifest for the deployment

spec:
  template:
    spec:
      containers:
      - env:
        - name: DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL
          value: 10000ms
        - name: DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE
          value: 1000000ms
Alex P
  • 71
  • 5
  • Do you mind to tell how you set these variable within the pod? – rpanai Sep 01 '20 at 22:36
  • Setting `distributed.admin.tick.interval` setting to a higher number, eg. `100ms` (it's 20 by default) can also be helpful (taken from https://github.com/dask/distributed/issues/2156#issuecomment-503735503). – Michał Zawadzki Apr 19 '22 at 21:38