5

I'm using dask/distributed to submit 100+ evaluations of a function to the multi-node cluster. Each eval is very costly, about 90 sec of CPU time. I've noticed though that there seems to be a memory leak and all workers over time grow in size, although the function i'm evaluating is not pure. Here's sample code to reproduce this behavior:

import numpy as np
from dask.distributed import Client

class Foo:
    def __init__(self):
        self.a = np.random.rand(2000, 2000)  # dummy data, not really used

    @staticmethod
    def myfun1(k):
        return np.random.rand(10000 + k, 100)

    def myfun2(self, k):
        return np.random.rand(10000 + k, 100)

client = Client('XXX-YYY:8786')
f = Foo()
tasks = client.map(f.myfun2, range(100), pure=False)
results = client.gather(tasks)
tasks = []

If client.map() is called to execute f.myfun1() (which is just a static method), the workers don't grow in size. However, if one calls f.myfun2() workers size grows considerably (eg. 50mb -> 400mb) after just one client.map() call above. Also client.close() does nothing to reduce workers size.

Is this a memory leak or I'm not using dask.distributed correctly? I definitely don't care about results of my calculations being available afterwards or shared on the cluster. FWIW, tested with distributed v1.19.1 and Python 3.5.4

marioba
  • 51
  • 1

1 Answers1

1

Nice example.

Your myfun2 method is attached to your f = Foo() object, which carries around with it a decently large attribute (f.a). This f.myfun2 method is thus actually really expensive to move around, and you're creating 1000 of them. If you can it's best to avoid using methods of large objects in a distributed setting. Instead consider using functions.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks Matthew. The problem here is really not the cost of moving that data around, but what happens after calculation... i.e. obervations that all workers size/footprint increases considerably over time (that's why i asked about possible memory-leak?). The only way to fix it has been to run client.restart() periodically but that resets the entire cluster. So if my goal is just to submit an expensive/long calculation to a worker without the need to share data/results between workers, is there a better way to do this and also to clear/reset worker state after single calc? – marioba Nov 01 '17 at 14:09