parallel execution of dask `DataFrame.set_index()`

Question

I am trying to create an index on a large dask dataframe. No matter what scheduler I am unable to utilize more than the equivalent of one core for the operation. The code is:

(ddf.
 .read_parquet(pq_in)
 .set_index('title', drop=True, npartitions='auto', shuffle='disk', compute=False)
 .to_parquet(pq_out, engine='fastparquet', object_encoding='json', write_index=True, compute=False)
 .compute(scheduler=my_scheduler)
)

I am running this on a single 64-core machine. What can I do to utilize more cores? Or is set_index inherently sequential?

score 1 · Answer 1 · answered Nov 22 '18 at 05:04

1

That should use multiple cores, though using disk for shuffling may introduce other bottlenecks like your local hard drive. Often you aren't bound by additional CPU cores.

In your situation I would use the distributed scheduler on a single machine so that you can use the diagnostic dashboard to get more insight about your computation.

answered Nov 22 '18 at 05:04

MRocklin

55,641
23
163
235

changing using the distributed scheduler and setting `shuffle='disk'` improves parallelism, but seems to make dask try to load all data into memory. Is it possible to do a parallel shuffle with larger than memory data? – Daniel Mahler Nov 23 '18 at 03:41
Actually my data does fit into memory. The problem is that the distributed scheduler seems to be loading the whole dataset into each worker process. – Daniel Mahler Nov 23 '18 at 04:39

parallel execution of dask `DataFrame.set_index()`

1 Answers1