I have a small dataframe (about ~100MB) and an expensive computation that I want to perform for each row. It is not a vectorizable computation; it requires some parsing and a DB lookup for each row.
As such, I have decided to try Dask to parallelize the task. The task is "embarrassingly parallel" and order of execution or repeated execution is no issue. However, for some unknown reason, memory usage blows up to about ~100GB.
Here is the offending code sample:
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask_jobqueue import LSFCluster
cluster = LSFCluster(memory="6GB", cores=1, project='gRNA Library Design')
cluster.scale(jobs=16)
client = Client(cluster)
required_dict = load_big_dict()
score_guide = lambda row: expensive_computation(required_dict, row)
library_df = pd.read_csv(args.library_csv)
meta = library_df.dtypes
meta = meta.append(pd.Series({
'specificity': np.dtype('int64'),
'cutting_efficiency': np.dtype('int64'),
'0 Off-targets': np.dtype('object'),
'1 Off-targets': np.dtype('object'),
'2 Off-targets': np.dtype('object'),
'3 Off-targets': np.dtype('object')}))
library_ddf = dd.from_pandas(library_df, npartitions=32)
library_ddf = library_ddf.apply(score_guide, axis=1, meta=meta)
library_ddf = library_ddf.compute()
library_ddf = library_ddf.drop_duplicates()
library_ddf.to_csv(args.outfile, index=False)
My guess is that somehow the big dictionary required for lookup is the issue, but its size is only ~1.5GB in total and is not included in the resultant dataframe.
Why might Dask be blowing up memory usage?