3

I am trying to merge multiple pandas dataframes onto a large Dask dataframe with fields ["a_id", "b_id", "c_id"]. Each pandas dataframe "A", "B", and "C" has a unique field ("a_id", "b_id", and "c_id") that joins it to the Dask dataframe. "B" and "C" also have a field "b_Field1":

import pandas as pd
import dask.dataframe as dd

A = pd.DataFrame({'a_id': [1, 2, 3], 'a_Field1': [0, 0, 0]})
B = pd.DataFrame({'b_id': [3, 4, 5], 'b_Field1': [7, 8, 9]})
C = pd.DataFrame({'c_id': [4, 5, 6], 'b_Field1': [6, 7, 8], 'c_Field1': [10, 11, 12]})

pdf = pd.DataFrame({'a_id': [1, 2], 'b_id': [3, 4], 'c_id': [5, 6]})
pdf = pdf.merge(A, how="left", on="a_id")
pdf = pdf.merge(B, how="left", on="b_id")
pdf = pdf.merge(C, how="left", on=["c_id", "b_Field1"])

print(pdf)

"""
Result:
   a_id  b_id  c_id  a_Field1  b_Field1  c_Field1
0     1     3     5         0         7        11
1     2     4     6         0         8        12
"""

dA = dd.from_pandas(A, npartitions=1)
dB = dd.from_pandas(B, npartitions=1)
dC = dd.from_pandas(C, npartitions=1)
ddf = dd.from_pandas(pdf, npartitions=1)

ddf = ddf.merge(dA, how="left", on="a_id")
ddf = ddf.merge(dB, how="left", on="b_id")
ddf = ddf.merge(dC, how="left", on=["c_id", "b_Field1"])

This fails, saying that there is no field "b_Field1" in ddf. My thinking was that I need to execute a .compute() command between merging B and C, but this causes Dask to hang endlessly with 40% on the progress bar (eventally dies with a MemoryError).

Is the compute necessary before doing the second join? And if so, any reason it would hang? These datasets are just barely small enough to merge in pure Pandas, and the merge happens quickly, but I'm trying to make this deployable on machines with less RAM.

triphook
  • 2,915
  • 3
  • 25
  • 34

1 Answers1

1

If you do examine the dataframe before the final line, you will find it has columns:

a_id  b_id  c_id  a_Field1_x  b_Field1_x  c_Field1  a_Field1_y  b_Field1_y

i.e., b_Field1 has become split in two, and indeed it turns out that the two are identical. This is probably a bug in Dask since, as you show, the same does not happen in Pandas. However, setting appropriate index or tuning optional args to merge may be a workaround.

With the dataframe as it is, you can do

ddf = ddf.merge(dC.rename(columns={'b_Field1': 'b_Field1_x'}), 
     how="left", on=["c_id", "b_Field1_x"])

where now you also get duplicated c_ columns.

On the general memory issue, this is discussed in great length elsewhere. Be sure to carefully choose your partition sizes, index and number of workers.

mdurant
  • 27,272
  • 5
  • 45
  • 74