19

I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this:

WARNING - Worker exceeded 95% memory budget. Restarting.

I am just working on my local machine, initiating dask as follows:

if __name__ == '__main__':
    libmarket.config.client = Client()  # use dask.distributed by default

Now in my error messages I keep seeing a reference to a 'memory_limit=' keyword parameter. However I've searched the dask documentation thoroughly and I can't figure out how to increase the bloody worker memory-limit in a single-machine configuration. I have 256GB of RAM and I'm removing the majority of the future's columns (a 20GB csv file) before converting it back into a pandas dataframe, so I know it will fit in memory. I just need to increase the per-worker memory limit from my code (not using dask-worker) so that I can process it.

Please, somebody help me.

Jones
  • 333
  • 1
  • 2
  • 5
  • 3
    did you modify your `~/.config/dask/distributed.yaml` ? – gold_cy Dec 26 '18 at 19:23
  • You have no idea how much I love you. I had modified distributed.yaml before but I was doing it in the wrong bloody file! Thank you thank you thank you. – Jones Dec 26 '18 at 19:42
  • 1
    no problem, happy computing! – gold_cy Dec 26 '18 at 19:47
  • By the way, I only see options for changing behavior as specific fractions of the memory limit, is there a way to raise the memory limit entirely? – Jones Dec 26 '18 at 19:51
  • 2
    I ended up using: Client(memory_limit='64GB') – Jones Jan 19 '19 at 00:26
  • 1
    @Jones - me too. Then what's the relevance of memory limit - if 64GB is allocated to a single worker. Did you find a way around? – Devi Prasad Khatua Apr 16 '19 at 13:13
  • I actually ended up changing to using dask delayed and distributing my operations with pandas then building up the results in a single dataframe, rather than trying to load everything and manipulate it in one large dataframe. So the issue became moot. – Jones Apr 17 '19 at 14:03

1 Answers1

23

The argument memory_limit can be provided to the __init()__ functions of Client and LocalCluster.

general remarks

Just calling Client() is a shortcut for first calling LocalCluster() and, then, Client with the created cluster (Dask: Single Machine). When Client is called without an instance of LocalCluster, all possible arguments of LocalCluster.__init()__ can be provided to the initialization call of Client. Therefore, the argument memory_limit (and other arguments such as n_workers) are not documented in the API documentation of the Client class.

However, the argument memory_limit does not seem to be properly documented in the API documentation of LocalCluster (see Dask GitHub Issue #4118).

solution

A working example would be the following. I added some more arguments, which might be useful for people finding this question/answer.

# load/import classes
from dask.distributed import Client, LocalCluster

# set up cluster with 4 workers. Each worker uses 1 thread and has a 64GB memory limit.
cluster = LocalCluster(n_workers=4, 
                       threads_per_worker=1,
                       memory_limit='64GB')
client = Client(cluster)

# have a look at your workers
client

# do some work
## ... 

# close workers and cluster
client.close()
cluster.close()

The shortcut would be

# load/import classes
from dask.distributed import Client

# set up cluster with 4 workers. Each worker uses 1 thread and has a 64GB memory limit.
client = Client(n_workers=4, 
                threads_per_worker=1,
                memory_limit='64GB')

# have a look at your workers
client

# do some work
## ... 

# close workers and cluster
client.close()

further reading

Danferno
  • 472
  • 5
  • 12
daniel.heydebreck
  • 768
  • 14
  • 22