4

I'm trying to understand how df.persist() works in dask. Would I build the same expression again, would it recalculate it or load it from cache?

E.g. what happens when I do:

ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.sum().compute())
del ddf
ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.mean().compute())

Does dask read the .csv and shift by one twice, or the second time it comes from cache? Do I need the second .persist()? If it keeps it in cache, how do I force cleaning the cache?

Mark Horvath
  • 1,136
  • 1
  • 9
  • 24

1 Answers1

3

When you call persist it keeps the data in distributed memory so that you don't need to recompute that part of the computation a second time.

You can release the memory by deleting the collection, as you do in line 3.

If you delete the collection, then yes, you will need to re-persist the intermediate results.

https://distributed.dask.org/en/latest/memory.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks! It became clear to me too, that persist is only persisting in RAM and holds it until the future returned by persist() is freed up. – Mark Horvath Sep 01 '19 at 14:45