9
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.91 GB -- Worker memory limit: 2.00 GB
distributed.worker - WARNING - Worker is at 41% memory usage. Resuming worker. Process memory: 825.12 MB -- Worker memory limit: 2.00 GB

The above error appears when I try to run a piece of code that applies an algorithm to a dataset that I have. Having read through the documentation at https://distributed.dask.org/en/latest/worker.html, it's still not clear to me what the impact of this error will be on the results of this application. Does this just affect the speed or efficiency of this code, or will it impact my results?

AHassett
  • 91
  • 2
  • 3

2 Answers2

8

That warning is saying that your process is taking up much more memory than you are saying is OK. In this situation Dask may pause execution or even start restarting your workers.

The warning also says that Dask itself isn't holding on to any data, so there isn't much that it can do to help the situation (like remove its data). My guess is that some of the libraries that you are using taking up a lot of memory. You might want to use Dask workers that have more than 2GB of memory.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • So if I can increase the memory given to the worker, then I shouldn't have this problem? Moreover, this affects the speed at which this is conducting this, rather then how it processes the data? – AHassett Feb 22 '20 at 17:10
  • Increasing memory is a good place to start. 2GB is small for the Python data science stack, which tends to take up around 500MB in memory just for the code. I'm unable to guarantee anything for your particular situation though. – MRocklin Feb 22 '20 at 18:47
  • The wording is confusing to me personally - "Memory use is high" implies a lot of data, and "worker has no data" implies very little data. Should the 'but' be an 'and therefore..' I'm not being picky; I honestly don't know what is happening in this situation. – Michael Tuchman May 06 '20 at 19:41
  • I also don't know. Dask isn't the only thing running in your process. Maybe something else is taking up or leaking memory. – MRocklin May 23 '20 at 18:59
1

This is the case for one of my project: your output from map_partitions is also a large pandas dataframe. as output is stored in memory, so you are having this warning.\

consider below example:

def Processor(df):
    gf = ....
    return len(gf)

out= dask_df.map_partitions(Processor).compute(scheduler='processes')
Sadegh Karimi
  • 21
  • 1
  • 6
  • @bhakti123 When you are working with dask, try not to run on all partition the function which needs to use all the data – Sadegh Karimi Aug 15 '21 at 14:38