I have a large (~180K row) dataframe for which
df.compute()
hangs when running dask with the distributed scheduler in local mode on an AWS m5.12xlarge (98 cores). All the worker remain nearly idle However
df.head(df.shape[0].compute(), -1)
completes quickly, with good utilization of the available core.
Logically the above should be equivalent. What causes the difference?
Is there some parameter I should pass to compute
in the first version to speed it up?