2

I'm trying to execute a custom dask graph on a distributed system, the thing is that it seems to be not releasing memory of finished tasks. Am I doing something wrong?

I've tried changing the number of processes and using a local cluster but it doesn't seem to make a difference.

from dask.distributed import Client
from dask import get
import pandas as pd

client = Client()

def get_head(df):
    return df.head()

process_big_file_tasks = {f'process-big-file-{i}': (pd.read_csv, '/home/ubuntu/huge_file.csv') for i in range(50)}
return_fragment_tasks = {f'return-fragment-{i}': (get_head, previous_task) for i, previous_task in enumerate(process_big_file_tasks)}

dsk = {
    **process_big_file_tasks,
    **return_fragment_tasks,
    'concat': (pd.concat, list(return_fragment_tasks))
}

client.get(dsk, 'concat')

Since the tasks are independent of each other (or at least the ones that consume the most memory), when every one finishes its memory should be released.

1 Answers1

0

How do you determine that it isn't releasing memory? I recommend looking at Dask's dashboard to see the structure of the computation, including what has been released and not. This youtube video may be helpful

https://www.youtube.com/watch?v=N_GqzcuGLCY

In particular, I encourage you to watch the Graph tab while running your computation.

MRocklin
  • 55,641
  • 23
  • 163
  • 235