I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed.
I created a local Client and use client.compute()
to send tasks. Then I use future.result()
to get the result of each task.
I'm using threads to ask for results at the same time and measure the time for each result to compute like this:
def get_result(future,i):
t0 = time.time()
print("calculating result", i)
result = future.result()
print("result {} took {}".format(i, time.time() - t0))
client = Client()
df = dd.read_csv(path_to_csv)
future1 = client.compute(df[df.x > 200])
future2 = client.compute(df[df.x > 500])
threading.Thread(target=get_result, args=[future1,1]).start()
threading.Thread(target=get_result, args=[future2,2]).start()
I expect the output of the above code to be something like:
calculating result 1
calculating result 2
result 2 took 10
result 1 took 46
Since the first task is larger.
But instead I got both at the same time
calculating result 1
calculating result 2
result 2 took 46.3046760559082
result 1 took 46.477620363235474
I asume that is because future2 actually computes in the background and finishes before future1, but it waits until future1 is completed to return.
Is there a way I can get the result of future2 at the moment it finishes ?