2

It’s sometimes appealing to use dask.dataframe.map_partitions for operations like merges. In some scenarios, when doing merges between a left_df and a right_df using map_partitions, I’d like to essentially pre-cache right_df before executing the merge to reduce network overhead / local shuffling. Is there any clear way to do this? It feels like it should be possible with either one of or a combination of client.scatter(the_df), client.run(func_to_cache_the_df), or some other intelligent broadcasting.

It’s particularly salient in the context of doing a left join on a large left_df with a much smaller right_df that is essentially a lookup table. It feels like this right_df should be able to read into memory and persisted/scattered to all workers/partitions pre-merge to reduce the need for cross-partition communication until the very end. How can I scatter right_df to successfully do this?

The following is a smaller example of this kind of imbalanced merge using cuDF and Dask (but conceptually this would be the same with pandas and Dask):

import pandas as pd
import cudf
import dask_cudf
import numpy as np
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

# create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)

np.random.seed(12)

nrows_left = 1000000
nrows_right = 1000

left = cudf.DataFrame({'a': np.random.randint(0,nrows_right,nrows_left), 'left_value':np.arange(nrows_left)})
right = cudf.DataFrame({'a': np.arange(nrows_right), 'lookup_val': np.random.randint(0,1000,nrows_right)})

print(left.shape, right.shape) # (1000000, 2) (1000, 2)

ddf_left = dask_cudf.from_cudf(left, npartitions=500)
ddf_right = dask_cudf.from_cudf(right, npartitions=2)

def dask_merge(L, R):
    return L.merge(R, how='left', on='a')

result = ddf_left.map_partitions(dask_merge, R=ddf_right).compute()
result.head()
<cudf.DataFrame ncols=3 nrows=5 >
     a  left_value  lookup_val
0  219        1952         822
1  873        1953         844
2  908        1954         142
3  290        1955         810
4  863        1956         910

Nick Becker
  • 4,059
  • 13
  • 19
  • You can scatter right_df by doing the following: > scattered_df = client.scatter([right_df], broadcast=True) @mrocklin may have thoughts here as well on the general approach – quasiben Jul 30 '19 at 18:56

1 Answers1

1

If you do any of the following then things should be ok:

  • A merge with a single-partition dask dataframe
  • A merge with a non-dask dataframe (like Pandas or cuDF)
  • A map_partitions with a non-dask dataframe (like Pandas or cuDF)

What happens is this:

  1. The single partition is pushed out to a single worker
  2. During execution a few workers will duplicate that data, and then others will duplicate from those workers, and so on, communicating the data out in a tree
  3. The workers will do the merge as expected

This is about as fast as can be expected. However, if you're doing something like benchmarking, and want to separate steps 1,2 and 3, then you can use client.replicate:

left = ... # multi-partition dataframe
right = ... # single-partition dataframe
right = right.persist()  # make sure it exists in one worker
client.replicate(right)  # replicate it across many workers

... proceed as normal

This won't be any faster, but steps 1,2 will be pulled out into the replicate step.

In your example it looks like right has two partitions. You might want to change this to one. Dask takes a different code path, which is essentially just map_partitions, in this case.

MRocklin
  • 55,641
  • 23
  • 163
  • 235