0

I have three sources and a Dask Dataframe for each of them. I need to apply a function that computes an operation that combines data from the three sources. The operation requires a state to be calculated ( I can't change that).

The three sources are in parquet format and I read the data using read_parquet Dask Dataframe function:

    @dask.delayed
    def load_data(data_path):
        ddf = dd.read_parquet(data_path, engine="pyarrow")
        return ddf

    results = []
    sources_path=["/source1","/source2","/source3"]
    for source_path in sources_path:
       data = load_data(source_path)
       results.append(data)

I create another delayed function that executes the operation:

  @dask.delayed
  def process(sources):
      operation(sources[0][<list of columns>],sources[1][<list of columns>],sources[2][<list of columns>])

The operation function comes from a custom library. It could not actually be parallelized because it has an internal state.

Reading the dask documentation, this is not a best practice.

Is there a way to apply a custom function on multiple dask dataframe without using delayed function?

Stefano Castoldi
  • 193
  • 2
  • 10
  • You can apply your function to each partition using dask.dataframe.DataFrame.map_partitions – Michael Delgado Feb 15 '23 at 05:16
  • Unfortunately, I can't use `map_partitions`. The `operation` function requires access to all data and could not be parallelized. – Stefano Castoldi Feb 15 '23 at 09:23
  • 1
    Well then you just need more memory and shouldn’t use dask. Dask can’t just magically make the function run on a bigger dataset… you need to decide how you want to split up the work or else you have a hardware problem – Michael Delgado Feb 15 '23 at 09:50
  • If you’re running this on multiple dask dataframes and each one is small enough to fit into memory then you should use dask.delayed and pandas, *not dask.dataframe*. So that might be the confusion? – Michael Delgado Feb 15 '23 at 09:53

1 Answers1

0

Is there a way to apply a custom function on multiple dask dataframe without using delayed function?

As indicated in the comments, there is no need to use Delayed at all when working with Dask Dataframes. Among others, you've got the map_partitions function, and plenty other choices.

However, since it seems your operation function needs the entire Dataframes in memory, there is no point of using Dask at all.

Guillaume EB
  • 317
  • 2
  • 12
  • the `operation` function doesn't require that all data is in memory at the same time. It has an internal state and data could be provided by a generator. The `map_partitions` operate only on one dataset at a time. My function should combine data from different sources (datasets). – Stefano Castoldi Feb 24 '23 at 09:20
  • `map_partitions` can take several dask Dataframe in input, the partitions must be aligned. But I'm not sure how you can do something about the internal state. – Guillaume EB Mar 03 '23 at 12:13