4

I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.

I converted the dataframe to 150 partiitions (700MB each). However

My simple set_index() operation fails with error "95% memory reached"

    g=dd.read_parquet(geodata,columns=['lng','indexCol'])
    g.set_index('lng').to_parquet('output/geodata_vec_lng.parq',write_index=True  )

I tried:

  • 1 worker 4 threads. 55 GB assigned RAM
  • 1 worker 2 threads. 55 GB assigned RAM
  • 1 worker 1 thread. 55 GB assigned RAM

If I make the partitions smaller I get exponentially more shuffling. 100GB is not large. What am I doing wrong?

user670186
  • 2,588
  • 6
  • 37
  • 55
  • Am I missing something or do I need to decrease partition sizes until it works? very frustrating experience :( It could also be that I I am reading from a bigger data frame but just selecting 2 columns out of 4. Would I have to export the dataframe with 2 columns first? – user670186 Dec 02 '19 at 14:33
  • 1
    What is the partition size of the original data? Is 100GB the on-disc or in-memory size? What are the types of your four columns? – mdurant Dec 02 '19 at 16:51
  • columns are: int32, inte32, float64, str : each partition file on disk is 700-900MB. How can I see how big they are in memory? also I am not loading all 4 columns, but only 2 of them via parquet – user670186 Dec 02 '19 at 22:28
  • 1
    I've encountered similar problem when using `.set_index`, it terminates worker several times and MemorryError at the end. What possibly is causing the problem? @mdurant – Matthew Son Mar 11 '20 at 11:42
  • I have the same issues on a 50 GB on disk dataset and ~500GB of available RAM. Still not enough. Let me see if I can create minimal example and then perhaps reopen [this related gh issue](https://github.com/dask/dask/issues/2983). – kuropan Apr 17 '21 at 09:24
  • You should definitely make sure you have `dask > 2021.3.0`, see this [issue](https://github.com/dask/dask/issues/7259) and [SO](https://stackoverflow.com/questions/64903267/) – kuropan Apr 17 '21 at 10:07

0 Answers0