0

I have a large (~180K row) dataframe for which

df.compute()

hangs when running dask with the distributed scheduler in local mode on an AWS m5.12xlarge (98 cores). All the worker remain nearly idle However

df.head(df.shape[0].compute(), -1)

completes quickly, with good utilization of the available core.

Logically the above should be equivalent. What causes the difference? Is there some parameter I should pass to compute in the first version to speed it up?

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90

1 Answers1

0

When you call .compute() you are asking for the entire result in your local process as a pandas dataframe. If that result is large then it might not fit. Do you need the entire result locally? If not then perhaps you wanted .persist() instead?

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • I do want the whole dataframe collected. Although large it easily fits on. on my machine. The second snippet `df.head(df.shape[0].compute(), -1)` does what I need quite quicikly. I am really asking why does`compute` stall? – Daniel Mahler Jun 18 '19 at 19:11
  • No clue. That depends on everything in your computation. It could be anything really. – MRocklin Jun 19 '19 at 14:14
  • ok, but why would `df.head(df.shape[0].compute(), -1)` not have the same problem as `compute()`? Don't they do the same thing? – Daniel Mahler Jun 20 '19 at 03:44