0

I have been working on a python script using dask to speed up the processing time. At a high level, the script calls a dask delayed function a number of times to perform new computations. Each time the dask delayed function is called, it has no relation to the previous call. A simplified example of what I have is shown below.

def func1(list1):
    for subList in list1:
        t2 = dask.delayed(func2)(subList)
        output.append(t2)
    output = dask.compute(*output)
    return output

def func2(subList):
    #Some operations
    return tuple2 # Large Tuple with combination of lists and numpy arrays

if __name__ == '__main__':

    largeList = [..]

    for list1 in largeList:
        output = func1(list1)
        print(output)  

I noticed that as this program was being executed, each time func1 was called, the time to completion gradually got longer. At first I believed this was a memory issue because the variable, output, is typically a large tuple with many arrays and lists. However, looking at the Dask dashboard while the program was running, the 'Bytes Stored' plot didn't seem to max out. I know this is a very vague example, but does anybody have any ideas about why the func1 slows down the more times it is called? My hunch is still that the large tuple output has something to do with the issue. If so, how can I fix this problem? Any feedback on this would be greatly appreciated.

rmsrms1987
  • 317
  • 1
  • 5
  • 11

0 Answers0