2

I'm using delayed to read many large CSV files:

import pandas as pd

def function_1(x1, x2):         
    df_d1 = pd.read_csv(x1)
    # Some calculations on df_d1 using x2.
    return df_d1

def function_2(x3):         
    df_d2 = pd.read_csv(x3)
    return df_d2

def function_3(df_d1, df_d2):         
    # some calculations and merging data-sets (output is "merged_ds").
    return merged_ds
  • function_1: importing data-set 1 and doing some calculations.
  • function_2: importing data-set 2.
  • function_3: merge data-sets and some calculations.

Next, I use a loop to call these functions using delayed function. I have many CSV files, and every file is more than 500MB. Is this a suitable procedure to do my tasks using DASK (delayed)?

Eghbal
  • 3,892
  • 13
  • 51
  • 112

1 Answers1

1

Yes, please go ahead and delay your functions and submit them to Dask. The most memory-heavy is likely to be function_3, and you may want to consider how many of these you can hold in memory at a time - use the distributed scheduler to control how many workers and threads you have and their respective memory limits https://distributed.readthedocs.io/en/latest/local-cluster.html

Finally, you I'm sure do not want to return the final merged dataframes, that surely does not fit in memory: you probably mean to aggregate over them or write out to other files.

mdurant
  • 27,272
  • 5
  • 45
  • 74