1

I am trying to import a 30 Gb csv file and convert it to HDF5 through vaex with the following code. I read that setting convert to true would prevent an OutOfMemory error, although I continue to get the error after nearly 30 minutes of trying to load the data.

import vaex

vaex.from_csv("combined.csv", convert=True, chunk_size=5_000_000)

I get the following error: MemoryError: Unable to allocate 26.1 GiB for an array with shape (701, 5000000) and data type object When looking at the vaex FAQ and their documentation, this seems like the best and only way to deal with such a large csv file. Am I missing something, or is there a better way to do this?

kcw78
  • 7,131
  • 3
  • 12
  • 44
rochimer
  • 87
  • 8
  • 1
    I assume you don't have 26 GiB of available system RAM. Have you tried reducing `chunk_size` to a smaller number? Using 1_000_000 _should_ reduce the chunk memory footprint to 2.61 GiB. Adjust according to your system RAM. The `object dtype` is interesting. Any idea what that is? If changing chunk_size doesn't work, you could use `np.genfromtxt()` to read the CSV incrementally and `h5py` to create the HDF5 file. There are parameters to control # of rows read. So, it is slightly more complicated b/c you have to manage that. – kcw78 Jun 22 '22 at 15:30
  • you seem to have a large number of columns. One column in CSV will take up much much less space than in HDF5 or ARROW, because what is say 5.2 (a random float example) will be by default converted to a float64 for the HDF5 file. So indeed try a much lower number for the chunk_size.. (or use only those columns you need). – Joco Jun 22 '22 at 22:43

0 Answers0