It’s sometimes appealing to use dask.dataframe.map_partitions
for operations like merges. In some scenarios, when doing merges between a left_df
and a right_df
using map_partitions
, I’d like to essentially pre-cache right_df
before executing the merge to reduce network overhead / local shuffling. Is there any clear way to do this? It feels like it should be possible with either one of or a combination of client.scatter(the_df)
, client.run(func_to_cache_the_df)
, or some other intelligent broadcasting.
It’s particularly salient in the context of doing a left join on a large left_df
with a much smaller right_df
that is essentially a lookup table. It feels like this right_df
should be able to read into memory and persisted/scattered to all workers/partitions pre-merge to reduce the need for cross-partition communication until the very end. How can I scatter right_df
to successfully do this?
The following is a smaller example of this kind of imbalanced merge using cuDF and Dask (but conceptually this would be the same with pandas and Dask):
import pandas as pd
import cudf
import dask_cudf
import numpy as np
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
# create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)
np.random.seed(12)
nrows_left = 1000000
nrows_right = 1000
left = cudf.DataFrame({'a': np.random.randint(0,nrows_right,nrows_left), 'left_value':np.arange(nrows_left)})
right = cudf.DataFrame({'a': np.arange(nrows_right), 'lookup_val': np.random.randint(0,1000,nrows_right)})
print(left.shape, right.shape) # (1000000, 2) (1000, 2)
ddf_left = dask_cudf.from_cudf(left, npartitions=500)
ddf_right = dask_cudf.from_cudf(right, npartitions=2)
def dask_merge(L, R):
return L.merge(R, how='left', on='a')
result = ddf_left.map_partitions(dask_merge, R=ddf_right).compute()
result.head()
<cudf.DataFrame ncols=3 nrows=5 >
a left_value lookup_val
0 219 1952 822
1 873 1953 844
2 908 1954 142
3 290 1955 810
4 863 1956 910