I am trying to create an index on a large dask dataframe. No matter what scheduler I am unable to utilize more than the equivalent of one core for the operation. The code is:
(ddf.
.read_parquet(pq_in)
.set_index('title', drop=True, npartitions='auto', shuffle='disk', compute=False)
.to_parquet(pq_out, engine='fastparquet', object_encoding='json', write_index=True, compute=False)
.compute(scheduler=my_scheduler)
)
I am running this on a single 64-core machine. What can I do to utilize more cores? Or is set_index
inherently sequential?