gathering a large dataframe back into master in dask distributed

Question

I have a large (~180K row) dataframe for which

df.compute()

hangs when running dask with the distributed scheduler in local mode on an AWS m5.12xlarge (98 cores). All the worker remain nearly idle However

df.head(df.shape[0].compute(), -1)

completes quickly, with good utilization of the available core.

Logically the above should be equivalent. What causes the difference? Is there some parameter I should pass to compute in the first version to speed it up?

score 0 · Answer 1 · answered Jun 16 '19 at 08:25

0

When you call .compute() you are asking for the entire result in your local process as a pandas dataframe. If that result is large then it might not fit. Do you need the entire result locally? If not then perhaps you wanted .persist() instead?

answered Jun 16 '19 at 08:25

MRocklin

55,641
23
163
235

I do want the whole dataframe collected. Although large it easily fits on. on my machine. The second snippet `df.head(df.shape[0].compute(), -1)` does what I need quite quicikly. I am really asking why does`compute` stall? – Daniel Mahler Jun 18 '19 at 19:11
No clue. That depends on everything in your computation. It could be anything really. – MRocklin Jun 19 '19 at 14:14
ok, but why would `df.head(df.shape[0].compute(), -1)` not have the same problem as `compute()`? Don't they do the same thing? – Daniel Mahler Jun 20 '19 at 03:44

gathering a large dataframe back into master in dask distributed

1 Answers1