I've understood it's really compute intensive to perform a set_index.
When reading the documentation, it's said that's the kind of operation we would like to avoid and to perform directly after ingesting data if needed.
I currently have parquet files I need to index on one column.
The code is straightforward :
df = dd.read_parquet(input_path)
df = df.set_index(index_column)
df.to_parquet(output_path)
I've performed a repartition on input files to get about 300 partitions of 1GB (parquet snappy so a partition ~ 70-80 MB).
I'm using a Dask Kubernetes cluster (workers: 2CPU, mempory: 16 or 32GB) and I was thinking this kind of operation would work even with a low number of workers (more workers only to accelerate the process).
I see that the memory tends to the high limit and workers get killed. Do the cluster need to be able to load all the full dataframe in memory across the workers ?
What to do to be able to do it even with very large dataframes ? How to debug if needed ? (local storage per worker, ...)