How to perform a set_index on a large dask dataframe and avoid workers to be killed?

Question

I've understood it's really compute intensive to perform a set_index.

When reading the documentation, it's said that's the kind of operation we would like to avoid and to perform directly after ingesting data if needed.

I currently have parquet files I need to index on one column.

The code is straightforward :

df = dd.read_parquet(input_path)
df = df.set_index(index_column)
df.to_parquet(output_path)

I've performed a repartition on input files to get about 300 partitions of 1GB (parquet snappy so a partition ~ 70-80 MB).

I'm using a Dask Kubernetes cluster (workers: 2CPU, mempory: 16 or 32GB) and I was thinking this kind of operation would work even with a low number of workers (more workers only to accelerate the process).

I see that the memory tends to the high limit and workers get killed. Do the cluster need to be able to load all the full dataframe in memory across the workers ?

What to do to be able to do it even with very large dataframes ? How to debug if needed ? (local storage per worker, ...)

No, you don't need to be able to load the entire dataframe in memory. Do you have logs for why the worker was killed? Is it possible to run outside of k8s ? — quasiben, May 27 '20 at 21:06
I did not try outside of k8s. I do not see anything in the logs explaining why a worker is killed. — DavidK, May 28 '20 at 09:20
I have distributed.comm.core.CommClosedError: in : Stream is closed errors; but hard to reproduce with a simple example ... — DavidK, May 28 '20 at 14:56
This error for example : OSError: Timed out trying to connect to 'tcp://...:39661' after 60 s: Timed out trying to connect to 'tcp://...:39661' after 60 s: connect() didn't finish in time — DavidK, Jun 01 '20 at 08:30
not really; sometimes this one (File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 3215, in replicate assert count > 0 AssertionError) but it does not seem to impact the processing. — DavidK, Jun 02 '20 at 10:35
any luck on this? I'm running into the same issue with a huge dataframe. I'm able to successfully set the new index, but when I write back to parquet, the memory consumption grows and grows. When I write to parquet without having set the index the memory reaches a steady state as the individual parquet files are written, which is what I would expect. It feels like a bug that at the _parquet writing_ phase the memory doesn't reclaim itself as the writes to disk are complete @DavidK — schuess, Dec 09 '20 at 04:31

How to perform a set_index on a large dask dataframe and avoid workers to be killed?

0 Answers0