I'm trying to understand how df.persist()
works in dask
. Would I build the same expression again, would it recalculate it or load it from cache?
E.g. what happens when I do:
ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.sum().compute())
del ddf
ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.mean().compute())
Does dask
read the .csv
and shift by one twice, or the second time it comes from cache? Do I need the second .persist()
? If it keeps it in cache, how do I force cleaning the cache?