I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I converted the dataframe to 150 partiitions (700MB each). However
My simple set_index() operation fails with error "95% memory reached"
g=dd.read_parquet(geodata,columns=['lng','indexCol'])
g.set_index('lng').to_parquet('output/geodata_vec_lng.parq',write_index=True )
I tried:
- 1 worker 4 threads. 55 GB assigned RAM
- 1 worker 2 threads. 55 GB assigned RAM
- 1 worker 1 thread. 55 GB assigned RAM
If I make the partitions smaller I get exponentially more shuffling. 100GB is not large. What am I doing wrong?