dask 100GB dataframe sorting / set_index on new column out of memory issues

Question

I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.

I converted the dataframe to 150 partiitions (700MB each). However

My simple set_index() operation fails with error "95% memory reached"

    g=dd.read_parquet(geodata,columns=['lng','indexCol'])
    g.set_index('lng').to_parquet('output/geodata_vec_lng.parq',write_index=True  )

I tried:

1 worker 4 threads. 55 GB assigned RAM
1 worker 2 threads. 55 GB assigned RAM
1 worker 1 thread. 55 GB assigned RAM

If I make the partitions smaller I get exponentially more shuffling. 100GB is not large. What am I doing wrong?

Am I missing something or do I need to decrease partition sizes until it works? very frustrating experience :( It could also be that I I am reading from a bigger data frame but just selecting 2 columns out of 4. Would I have to export the dataframe with 2 columns first? — user670186, Dec 02 '19 at 14:33
What is the partition size of the original data? Is 100GB the on-disc or in-memory size? What are the types of your four columns? — mdurant, Dec 02 '19 at 16:51
columns are: int32, inte32, float64, str : each partition file on disk is 700-900MB. How can I see how big they are in memory? also I am not loading all 4 columns, but only 2 of them via parquet — user670186, Dec 02 '19 at 22:28
I've encountered similar problem when using `.set_index`, it terminates worker several times and MemorryError at the end. What possibly is causing the problem? @mdurant — Matthew Son, Mar 11 '20 at 11:42
I have the same issues on a 50 GB on disk dataset and ~500GB of available RAM. Still not enough. Let me see if I can create minimal example and then perhaps reopen [this related gh issue](https://github.com/dask/dask/issues/2983). — kuropan, Apr 17 '21 at 09:24
You should definitely make sure you have `dask > 2021.3.0`, see this [issue](https://github.com/dask/dask/issues/7259) and [SO](https://stackoverflow.com/questions/64903267/) — kuropan, Apr 17 '21 at 10:07

dask 100GB dataframe sorting / set_index on new column out of memory issues

0 Answers0