Dask.dataframe : out of memory when merging and groupby

Question

I am new to Dask and having some troubles with it.

I am using a machine ( 4GB RAM, 2 cores) to analyse two csv files ( key.csv: ~2 million rows about 300Mb, sig.csv: ~12 million row about 600Mb). With these data, pandas can't fit in the memory, so I switch to use Dask.dataframe, What I expect is that Dask will process things in small chunks that can be fit in the memory ( the speed can be slower, i don't mind at all as long as it works), however, somehow, Dask still uses up all of the memory.

My code as below:

    key=dd.read_csv("key.csv")
    sig=dd.read_csv("sig.csv")
  
    merge=dd.merge(key, sig, left_on=["tag","name"],
        right_on=["key_tag","query_name"], how="inner")
    merge.to_csv("test2903_*.csv") 
    # store results into  a hard disk since it can't be fit in memory

Did I make any mistakes? Any help is appreciated.

You might want to try reducing the chunksize in `dd.read_csv`. — MRocklin, Mar 29 '17 at 14:35
You may try using [Dask Distributed](http://distributed.readthedocs.io/en/latest/quickstart.html) running it with ["--memory-limit=auto" option](http://distributed.readthedocs.io/en/latest/worker.html#spill-excess-data-to-disk) — Vlad Frolov, Mar 31 '17 at 13:00

score 0 · Answer 1 · answered Oct 04 '21 at 14:54

Big CSV files generally aren't the best for distributed compute engines like Dask. In this example, the CSVs are 600MB and 300MB, which aren't huge. As specified in the comments, you can set the blocksize when reading the CSVs to make sure the CSVs are read into Dask DataFrames with the right number of partitions.

Distributed compute joins are always going to run faster when you can broadcast the small DataFrame before running the join. Your machine has 4GB of RAM and the small DataFrame is 300MB, so it's small enough to be broadcasted. Dask automagically broadcasts Pandas DataFrames. You can convert a Dask DataFrame to a Pandas DataFrame with compute().

key is the small DataFrame in your example. Column pruning the small DataFrame and making it even smaller before broadcasting is even better.

key=dd.read_csv("key.csv")
sig=dd.read_csv("sig.csv", blocksize="100 MiB")

key_pdf = key.compute()
  
merge=dd.merge(key_pdf, sig, left_on=["tag","name"],
        right_on=["key_tag","query_name"], how="inner")
merge.to_csv("test2903_*.csv")

Here's a MVCE:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4],
        "cities": ["Medellín", "Rio", "Bogotá", "Buenos Aires"],
    }
)
large_ddf = dd.from_pandas(df, npartitions=2)

small_df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4],
        "population": [2.6, 6.7, 7.2, 15.2],
    }
)

merged_ddf = dd.merge(
    large_ddf,
    small_df,
    left_on=["id"],
    right_on=["id"],
    how="inner",
)

print(merged_ddf.compute())

   id        cities  population
0   1      Medellín         2.6
1   2           Rio         6.7
0   3        Bogotá         7.2
1   4  Buenos Aires        15.2

Dask.dataframe : out of memory when merging and groupby

1 Answers1