0

I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed.

I created a local Client and use client.compute() to send tasks. Then I use future.result() to get the result of each task.

I'm using threads to ask for results at the same time and measure the time for each result to compute like this:

def get_result(future,i):
    t0 = time.time()
    print("calculating result", i)
    result = future.result()
    print("result {} took {}".format(i, time.time() - t0))

client = Client()
df = dd.read_csv(path_to_csv)

future1 = client.compute(df[df.x > 200])
future2 = client.compute(df[df.x > 500])

threading.Thread(target=get_result, args=[future1,1]).start()
threading.Thread(target=get_result, args=[future2,2]).start()

I expect the output of the above code to be something like:

calculating result 1
calculating result 2
result 2 took 10
result 1 took 46

Since the first task is larger.

But instead I got both at the same time

calculating result 1
calculating result 2
result 2 took 46.3046760559082
result 1 took 46.477620363235474

I asume that is because future2 actually computes in the background and finishes before future1, but it waits until future1 is completed to return.

Is there a way I can get the result of future2 at the moment it finishes ?

1 Answers1

1

You do not need to make threads to use futures in an asynchronous fashion - they are already inherently async, and monitor their status in the background. If you want to get results in the order they are ready, you should use as_completed.

However, fo your specific situation, you may want to simply view the dashboard (or use df.visulalize()) to understand the computation which is happening. Both futures depend on reading the CSV, and this one task will be required before either can run - and probably takes the vast majority of the time. Dask does not know, without scanning all of the data, which rows have what value of x.

mdurant
  • 27,272
  • 5
  • 45
  • 74