This is now a Github issue
What does the parameter compute
in Dask dataframe's set index
do?
df.set_index(col, compute=True)
The documentation says
compute: bool, default False
- Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set compute=False, an immediate computation will still be triggered if divisions is None.
This would suggest that if I provide divisions and set compute=True
, immediate computation will be triggered. This does not seem to be true, however.
import dask.datasets
df = dask.datasets.timeseries()
# Nothing gets submitted to the scheduler
df.set_index(
'name',
divisions=('Alice', 'Michael', 'Zelda'),
compute=True
)
Going down the stack of functions set_index
actually calls, it appears that the only place where compute
is actually used in rearrange_by_column_disk
. And indeed:
# Still, nothing gets submitted
df.set_index(
'name',
divisions=('Alice', 'Michael', 'Zelda'),
shuffle='tasks',
compute=True
)
# Something is computed here
df.set_index(
'name',
divisions=('Alice', 'Michael', 'Zelda'),
shuffle='disk',
compute=True
)
So what happens, exactly?
I suspect that the actual resulting partitions might be computed and saved to disk. If that's the case, then how could I tell this has happened?