0

I have a pc with a Nvida 3090 and 32GB ram.

I am loading a 9GB csv dataset, with millions of rows and 5 columns.

Anytime I run compute() it doesn't work and throws std::bad_alloc: out_of_memory: CUDA error.

How can I handle this data in my pc? To perform all the statistical operations, plots, ML, etc ...

jack
  • 13
  • 3
  • Does this happen during the load/read time (unlikely) or after some processing? If the latter, it would help to know which operations you are performing. – SultanOrazbayev May 15 '22 at 13:33
  • 2
    note that `compute()` loads the result fully into memory. So the out of memory issue could occur at a memory bottleneck during the workflow or just in computing the final result. 32GB isn't a ton of room for a 9GB dataset in a ML pipeline - all you need is a dimensionality expansion or a couple copies and you're done, so the diagnosis is very dependent on your chunking scheme and your workflow. not much else we can do without seeing your code. – Michael Delgado May 15 '22 at 23:37

1 Answers1

0

It sounds like you're using a single GPU to process this and trying to use dask_cudf to allow you run larger than GPU manipulations. As Michael said, compute() returns a cudf resultant dataframe that must fit on the GPU as well as the dask_cudf processing space. You can use .persist(). Coiled has a great blog on this: https://coiled.io/blog/dask-persist-dataframe/

Another option is to use dask-sql with RAPIDS and convert your data from csv to parquet. That can allow you to quickly and easily do chunked out of core processing of your data.

TaureanDyerNV
  • 1,208
  • 8
  • 9