0

I am trying to create an index on a large dask dataframe. No matter what scheduler I am unable to utilize more than the equivalent of one core for the operation. The code is:

(ddf.
 .read_parquet(pq_in)
 .set_index('title', drop=True, npartitions='auto', shuffle='disk', compute=False)
 .to_parquet(pq_out, engine='fastparquet', object_encoding='json', write_index=True, compute=False)
 .compute(scheduler=my_scheduler)
)

I am running this on a single 64-core machine. What can I do to utilize more cores? Or is set_index inherently sequential?

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90

1 Answers1

1

That should use multiple cores, though using disk for shuffling may introduce other bottlenecks like your local hard drive. Often you aren't bound by additional CPU cores.

In your situation I would use the distributed scheduler on a single machine so that you can use the diagnostic dashboard to get more insight about your computation.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • changing using the distributed scheduler and setting `shuffle='disk'` improves parallelism, but seems to make dask try to load all data into memory. Is it possible to do a parallel shuffle with larger than memory data? – Daniel Mahler Nov 23 '18 at 03:41
  • Actually my data does fit into memory. The problem is that the distributed scheduler seems to be loading the whole dataset into each worker process. – Daniel Mahler Nov 23 '18 at 04:39