1

I'm creating a function that reads and entire folder, creates a Dask dataframe, then processes the partitions of this dataframe and sums the results, like this:

import dask.dataframe as dd
from dask import delayed, compute

def partitions_func(folder):
    df = dd.read_csv(f'{folder}/*.csv')
    partial_results = []
    for partition in df.partitions:
        partial = another_function(partition)
        partial_results.append(partial)
    total = delayed(sum)(partial_results)
    return total

The function being called in partitions_func (another_function) is also delayed.

@delayed
def another_function(partition):
    # Partition processing
    return result

I checked and the variables created during the processing are all small, so they shouldn't cause any issues. The partitions can be quite large but not larger than the available RAM.

When I execute partitions_func(folder), the process gets killed. At first, I thought the problem had to do with having two delayed, one on another_function and one on delayed(sum).

Removing the delayed decorator from another_function causes issues because the argument is a Dask dataframe and you can't do operations like tolist(). I tried removing delayed from sum, because I thought it could be a problem with parallelisation and the available resources but the process also gets killed.

However, I know there are 5 partitions. If I remove the statement total = delayed(sum)(partial_results) from partitions_func and compute the sum "manually" instead, everything works as expected:

total = partial_results[0].compute() + partial_results[1].compute() + partial_results[2].compute() \
        + partial_results[3].compute() + partial_results[4].compute()

Thanks!

6659081
  • 381
  • 7
  • 21

1 Answers1

1

Dask dataframe creates a series of delayed objects, so when you call a delayed function another_function that becomes a nested delayed and dask.compute will not be able to handle it. One option is to use .map_partitions(), the typical example is df.map_partitions(len).compute(), which will compute length of each partition. So if you can rewrite another_function to accept a pandas dataframe, and remove the delayed decorator, then your code will roughly look like this:

df = dd.read_csv(f'{folder}/*.csv')
total = df.map_partitions(another_function)

Now total is a delayed object which you can pass to dask.compute (or simply run total = df.map_partitions(another_function).compute()).

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Hi @SultanOrazbayev, what you say is right. Now the code runs but very slowly. I can split the data and run the code with Pandas in less than an hour (sequentially), but with `dask`, it never finishes (my real data is even larger, so I need `dask`). I tried a simpler version of `another_function` (random operations on the df's) with `map_partitions`, and it run quickly. However, when I add the real method (a weighted binning where the rows are points), it gets slow. I checked the UI and `top`, and I saw that the workers use little resources, and the processes are sleeping most of the time. – 6659081 Feb 05 '21 at 18:29
  • Are you using any libraries inside `another_function`? If so try importing those libraries within the function (I had similar experience with some libraries). – SultanOrazbayev Feb 05 '21 at 18:31
  • 1
    Hi, yes, I was using a function from a library which was calling Cython, so I ended up coding the method myself. Not what I'd have wanted, but it works now. Thanks! – 6659081 Feb 18 '21 at 17:09