Reading large CSV files using delayed (DASK)

Question

I'm using delayed to read many large CSV files:

import pandas as pd

def function_1(x1, x2):         
    df_d1 = pd.read_csv(x1)
    # Some calculations on df_d1 using x2.
    return df_d1

def function_2(x3):         
    df_d2 = pd.read_csv(x3)
    return df_d2

def function_3(df_d1, df_d2):         
    # some calculations and merging data-sets (output is "merged_ds").
    return merged_ds

function_1: importing data-set 1 and doing some calculations.
function_2: importing data-set 2.
function_3: merge data-sets and some calculations.

Next, I use a loop to call these functions using delayed function. I have many CSV files, and every file is more than 500MB. Is this a suitable procedure to do my tasks using DASK (delayed)?

So you want `function_x` to be delayed? Why don't you produce a [mcve](/help/mcve)? — rpanai, Mar 04 '19 at 11:15

score 1 · Answer 1 · answered Mar 04 '19 at 21:11

Yes, please go ahead and delay your functions and submit them to Dask. The most memory-heavy is likely to be function_3, and you may want to consider how many of these you can hold in memory at a time - use the distributed scheduler to control how many workers and threads you have and their respective memory limits https://distributed.readthedocs.io/en/latest/local-cluster.html

Finally, you I'm sure do not want to return the final merged dataframes, that surely does not fit in memory: you probably mean to aggregate over them or write out to other files.

Reading large CSV files using delayed (DASK)

1 Answers1