2

I would expect in the following code, the first computation to take 3+sec and the second one to be much faster. What should I do to get dask to avoid re-doing a computation to the client? (I had previously searched for the answer to this question, regarding pure=True and have not found anything)

from dask import delayed, compute
from dask.distributed import Client

@delayed(pure=True)
def foo(a):
    time.sleep(3)
    return 1

foo_res = foo(1)

client = Client()

import time
t1 = time.time()
results = compute(foo_res, get=client.get)
t2 = time.time()
print("Time : {}".format(t2-t1))


t1 = time.time()
results = compute(foo_res, get=client.get)
t2 = time.time()
print("Time : {}".format(t2-t1))

output:

Time : 3.01729154586792
Time : 3.0170397758483887
julienl
  • 161
  • 12

1 Answers1

2

You need to use the persist method on the Client

foo_res = client.persist(foo_res)

This will start computation in the background and keep the results in memory for as long as some reference to foo_res is in your Python session

Relevant doc page is here: http://distributed.readthedocs.io/en/latest/manage-computation.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Ah I went through docs it but didn't find that, thanks! I really like how you hash the args btw which is object reference independent, but data specific. Very useful library, thanks! – julienl Jan 27 '17 at 14:33
  • 1
    The document you sent suggests that any of `persist`, `compute` etc would work so long as the reference remains. for ex: `event foo_res_future = client.compute(foo_res)` Seems to work for me is that correct? – julienl Jan 27 '17 at 17:06
  • Correct, but it should be `client.compute`, not `dask.compute`. Those are different for historical reasons. – MRocklin Jan 27 '17 at 17:52