26

Following the example here: YouTube: Dask-Pandas Dataframe Join I attempting to merge a ~70GB Dask data frame with a ~24MB that I loaded as a Pandas dataframe.

The merge is on two columns A and B, and I did not set any as indices:

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

small_df = pd.read_csv(dataframe1) # as pandas
large_df = dd.read_csv(dataframe2) #as dask.dataframe

df2 = large_df.merge(small_df, how='left', left_on=leftcolumns, right_on=rightcolumns) #do the merge

A = df2[df2['some column'] == 'somevalue'] #do a reduction that would fit on my computer

pbar = ProgressBar()
pbar.register()

result = A.compute()

I'm using a Windows computer with 16GB of RAM and 4 cores. I use the progress bar to assess how far along the merging process it is. I left it all night last night. I restarted it this morning and so far its about half an hour in and 0% progress.

Thank you and I appreciate your help,

Update

I tried it on my Mac with 8GB of RAM and worked pretty well. I have the Dask distribution that comes with Anaconda I believe. I don't think I did anything different in any case.

I share my results and time following the above coding (21 minutes):

In [26]: C = result1.compute()
[########################################] | 100% Completed | 21min 13.4s
[########################################] | 100% Completed | 21min 13.5s
[########################################] | 100% Completed | 21min 13.6s
[########################################] | 100% Completed | 21min 13.6s

Update 2

I updated to the latest version of Dask on my Windows computer and it worked well.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
dleal
  • 2,244
  • 6
  • 27
  • 49
  • 1
    Just a sidenote, is using `df` instead of `df2` in the boolean indexing a mistake? `A = df2[df['some column'] == 'somevalue']` – Khris Sep 13 '16 at 12:57
  • Oh sorry, I just saw what you meant. Yes, that is a mistake. I fixed it. Thanks! – dleal Sep 13 '16 at 13:06
  • 4
    @dleal, since this question is resolved, could you provide the solution as an answer and accept it (encouraged even if it's your own question), to remove this from unanswered list? Thanks! – Garrett Feb 09 '19 at 12:29
  • 1
    To add to @Garrett 's comment: please be specific about the Dask version you had and what did you just update to. – vmg Feb 15 '19 at 09:55
  • You could have also used joblib, and run the process on many cores. Genera idea is to split the large dataset into an array of smaller ones (i.e. `[large_df_part1, ... , large_df_partN]` and then use joblib to allocate to them to different processors. – Quantum_Something Dec 11 '19 at 09:13
  • Use BigQuery to handle this, Its resources and easy way to query by SQL, make it very easy. – Maryam Jul 08 '20 at 06:56

3 Answers3

1

you can iterate over unique equal values and assign other columns with loop:

unioun_set = list(set(small_df['common_column']) & set(large_df['common_column']))
for el in union_set:
    for column in small_df.columns:
        if column not in large_df.columns:
            large_df.loc[large_df['common_column'] == el,column] = small_df.loc[small_df['common_column'] ==  el,column]


-1

While working with big data, partitioning data is very important at the same time having enough cluster and memory size is mandatory.

You can try using spark.

DASK is a pure Python framework, which does more of same i.e. it allows one to run the same Pandas or NumPy code either locally or on a cluster. Whereas, Apache Spark brings about a learning curve involving a new API and execution model although with a Python wrapper.

You can try partitioning data and storing it into parquet files.

Kriti Pawar
  • 832
  • 7
  • 15
-1

You can join a large Dask DataFrame to a smaller pandas DataFrame with Dask.

The code below creates a Dask DataFrame with multiple partitions and execute a left join with a small pandas DataFrame:

import dask.dataframe as dd
import pandas as pd

# create sample 'large' pandas dataframe
df_large = pd.DataFrame(
    {
        "Name": ["Azza", "Brandon", "Cedric", "Devonte", "Eli", "Fabio"], 
        "Age": [29, 30, 21, 57, 32, 19]
    }
)

# create multi-partition dask dataframe from pandas
large = dd.from_pandas(df_large, npartitions=2)

# create sample small pandas dataframe
small = pd.DataFrame(
    {
        "Name": ["Azza", "Cedric", "Fabio"], 
        "City": ["Beirut", "Dublin", "Rosario"]
    }
)

# merge dask dataframe to pandas dataframe
join = ddf.merge(df2, how="left", on=["Name"])

# inspect results
join.compute()

I've written a blog post about merging Dask DataFrames here that may be a helpful resource for you. Especially the notes on setting index and sorted vs unsorted joins may be helpful to speed up your computations.

rrpelgrim
  • 342
  • 2
  • 13