Does dask dataframe.persist() keep results for the next query?

Question

I'm trying to understand how df.persist() works in dask. Would I build the same expression again, would it recalculate it or load it from cache?

E.g. what happens when I do:

ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.sum().compute())
del ddf
ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.mean().compute())

Does dask read the .csv and shift by one twice, or the second time it comes from cache? Do I need the second .persist()? If it keeps it in cache, how do I force cleaning the cache?

score 3 · Accepted Answer · answered Aug 12 '19 at 12:15

3

When you call persist it keeps the data in distributed memory so that you don't need to recompute that part of the computation a second time.

You can release the memory by deleting the collection, as you do in line 3.

If you delete the collection, then yes, you will need to re-persist the intermediate results.

https://distributed.dask.org/en/latest/memory.html

answered Aug 12 '19 at 12:15

MRocklin

55,641
23
163
235

Thanks! It became clear to me too, that persist is only persisting in RAM and holds it until the future returned by persist() is freed up. – Mark Horvath Sep 01 '19 at 14:45

Does dask dataframe.persist() keep results for the next query?

1 Answers1