5

Overall goal: I want to train a pytorch model on a data set that does not fit into memory.

Now forget that I spoke about pytorch, what it boils down to: Reading and writing a large file out of core or memory mapped.

I found a lot of libraries, but I couldn't find a single one that allows me to do a multi-threaded sequential read and write. What I want to do is having multiple threads that append to the file/dataframe (order does not matter, it should be shuffled for the downstream application anyways). And then when reading I only need sequential reading (no slicing, no indexing), but again multiple threads should be able to be fed.

I found/came up with the following solutions:

  • csv: Not an option, because storing floats yields to precision loss (also horrible to handle encoding and escaping)
  • numpy.memmep: You need to know the size of the array in advance, both for reading and writing, appending seems non-trivial.
  • dask: I can't find a way to append to a dataframe, it always creates a new one when appending, also a new dataframe seems not to be file-backed. This looks good for reading, but creating a new out of core dataframe is not documented.
  • xarray: Again no documentation on how to write to a file-backed dataframe, instead the documentation states It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched. So it seems not possible?
  • joblib: Same story, reading yes, iterative writing no.
  • blaze: Also no row appending
  • vaex: No row appending. Why‽

It's great that they all support out of core reading, but I need to get it in the specific file format first (writing) – what am I missing here?

Looks like multi-threaded writing is a hard problem. But even incremental writing single-threaded, but multi-threaded reading would already be good, but there seems to be no library that supports that?

dreamflasher
  • 1,387
  • 15
  • 22

2 Answers2

1

Multi-threaded sequential write can be error prone. Most systems typically prefer formats like Parquet that allow them to write each chunk of data to different files.

If you want to do actual parallel sequential writes you'll have to do some sort of locking, and you're probably on your own in terms of larger all-in-one systems.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Well, multi-threaded write would be nice to have. But there doesn't even seem to be a simple out of core multi-threaded read. – dreamflasher Aug 15 '19 at 21:28
  • 1
    Is there anything wrong with read/to_parquet from a library like Dask or Spark? They both are happy to read and write parquet data in chunks in parallel. – MRocklin Aug 16 '19 at 03:07
  • Well, see my comment to dask above "I can't find a way to append to a dataframe, it always creates a new one when appending, also a new dataframe seems not to be file-backed. This looks good for reading, but creating a new out of core dataframe is not documented." – dreamflasher Aug 16 '19 at 10:58
  • I can't find a way for sequentially writing to a dask dataframe – dreamflasher Aug 16 '19 at 10:59
1

I finally found a working solution with pyarrow.

Incremental writing:

import pyarrow as pa

result = []
writer = False
for _, row in df.iterrows():
  result.append(process_row(row))
  if len(result) >= 10000:
    batch = pa.RecordBatch.from_pandas(pd.DataFrame(result))
    if not writer:
      writer = pa.RecordBatchFileWriter(f'filename.arrow', batch.schema)
      writer.write(batch)
      result = []
batch = pa.RecordBatch.from_pandas(pd.DataFrame(result))
writer.write(batch)
writer.close()

Read all into one dataframe:

pa.RecordBatchFileReader("filename.arrow").read_pandas()

Incremental reading:

rb = pa.RecordBatchFileReader("filename.arrow")
for i in range(rb.num_record_batches):
  b = rb.get_batch(i)
dreamflasher
  • 1,387
  • 15
  • 22
  • the incremental writing example you posted does not appear to be multi-threaded writing (unless I'm missing something?). Were you able to adapt it to be multi-threaded? – Joe Jun 18 '20 at 13:21
  • Although I believe in 2020 it should be possible to write multi-threaded (we had databases with that feature in the 90s?), it looks like multi-threaded write is not possible with any library. At least we get multi-threaded read – but yeah, there's a lot of room for improvement in all of these libraries. – dreamflasher Jun 18 '20 at 17:12