1

I have a small dataframe (about ~100MB) and an expensive computation that I want to perform for each row. It is not a vectorizable computation; it requires some parsing and a DB lookup for each row.

As such, I have decided to try Dask to parallelize the task. The task is "embarrassingly parallel" and order of execution or repeated execution is no issue. However, for some unknown reason, memory usage blows up to about ~100GB.

Here is the offending code sample:

import pandas as pd
import numpy as np
import dask.dataframe as dd

from dask.distributed import Client
from dask_jobqueue import LSFCluster

cluster = LSFCluster(memory="6GB", cores=1, project='gRNA Library Design')
cluster.scale(jobs=16)
client = Client(cluster)

required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)

library_df = pd.read_csv(args.library_csv)

meta = library_df.dtypes
meta = meta.append(pd.Series({
    'specificity': np.dtype('int64'),
    'cutting_efficiency': np.dtype('int64'), 
    '0 Off-targets': np.dtype('object'),
    '1 Off-targets': np.dtype('object'),
    '2 Off-targets': np.dtype('object'),
    '3 Off-targets': np.dtype('object')}))
    
library_ddf = dd.from_pandas(library_df, npartitions=32)
library_ddf = library_ddf.apply(score_guide, axis=1, meta=meta)
library_ddf = library_ddf.compute()
library_ddf = library_ddf.drop_duplicates()
library_ddf.to_csv(args.outfile, index=False)

My guess is that somehow the big dictionary required for lookup is the issue, but its size is only ~1.5GB in total and is not included in the resultant dataframe.

Why might Dask be blowing up memory usage?

schmidt73
  • 181
  • 11

2 Answers2

1

Not 100% sure this will resolve it in this case, but you can try to futurize the dictionary:

# broadcasting makes sure that every worker has a copy
[fut_dict] = client.scatter([required_dict], broadcast=True)
score_guide = lambda row: expensive_computation(fut_dict, row)

What this does is put a copy of the dict on every worker and store reference to the object in fut_dict, obviating the need to hash the large dict on every call to the function:

Every time you pass a concrete result (anything that isn’t delayed) Dask will hash it by default to give it a name. This is fairly fast (around 500 MB/s) but can be slow if you do it over and over again. Instead, it is better to delay your data as well.

Note that this will eat away a part of each worker's memory (e.g. given your information, each worker will have 1.5GB allocated for the dict). You can read more in this Q&A.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • 1
    I tried something similar to this but failed to get this working (probably an error on my part). You have correctly identified the problem: the large dictionary as a global variable. I implemented a simpler solution because of a deadline. I will update with my solution and look into others tomorrow. Thank you for the help. – schmidt73 Mar 26 '21 at 06:02
  • I faced a similar issue a few weeks ago. Somehow, using data as a global variable overloaded the scheduler worker instead of the individual workers. e.g. at some point I was loading data in the constructor of a class, which was a very bad idea :) You should look in that direction – Mike Mar 26 '21 at 08:01
0

The problem is that the required_dict needs to be serialized and sent to all the worker threads. As required_dict is large and many workers need it simultaneously, repeated serializations cause a massive memory blowup.

There are many fixes; for me it was easiest to simply load the dictionary from the worker threads and explicitly use map_partitions instead of apply.

Here is the solution in code,

    def do_df(df):
        required_dict = load_big_dict()
        score_guide = lambda row: expensive_computation(required_dict, row)
        return df.apply(score_guide, axis=1)
        
    library_ddf = dd.from_pandas(library_df, npartitions=128)
    library_ddf = library_ddf.map_partitions(do_df)
    library_ddf = library_ddf.compute()
schmidt73
  • 181
  • 11