Overall goal: I want to train a pytorch model on a data set that does not fit into memory.
Now forget that I spoke about pytorch, what it boils down to: Reading and writing a large file out of core or memory mapped.
I found a lot of libraries, but I couldn't find a single one that allows me to do a multi-threaded sequential read and write. What I want to do is having multiple threads that append to the file/dataframe (order does not matter, it should be shuffled for the downstream application anyways). And then when reading I only need sequential reading (no slicing, no indexing), but again multiple threads should be able to be fed.
I found/came up with the following solutions:
csv
: Not an option, because storing floats yields to precision loss (also horrible to handle encoding and escaping)numpy.memmep
: You need to know the size of the array in advance, both for reading and writing, appending seems non-trivial.dask
: I can't find a way to append to a dataframe, it always creates a new one when appending, also a new dataframe seems not to be file-backed. This looks good for reading, but creating a new out of core dataframe is not documented.xarray
: Again no documentation on how to write to a file-backed dataframe, instead the documentation statesIt is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched.
So it seems not possible?joblib
: Same story, reading yes, iterative writing no.blaze
: Also no row appendingvaex
: No row appending. Why‽
It's great that they all support out of core reading, but I need to get it in the specific file format first (writing) – what am I missing here?
Looks like multi-threaded writing is a hard problem. But even incremental writing single-threaded, but multi-threaded reading would already be good, but there seems to be no library that supports that?