2

I have a data set that's larger than memory and I need to process it. I am not experienced in this subject thus any directions can help.

I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory. I have seen that pandas, numpy and python all support some form of memmap but I don't exactly understand how to go about and handle it. I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

Which means the resize isn't being saved to disk

Any suggestion?

Yorai Levi
  • 473
  • 5
  • 17
  • 1
    You might want to look into dask, which was designed for data that does not fit into memory. It has a pandas-like interface. https://dask.org/ – jkr May 04 '22 at 02:43
  • Thanks @jakub! I looked at dask and implemented some solution! However I am struggling to use my compressed files and parallelize it. I considering migrating my entire dataset into a different format or maybe even a database as dask has been proving difficult in that area. – Yorai Levi May 13 '22 at 13:39

1 Answers1

0

np.require() makes a copy of the memmap array, since it doesn't "own" its data. According to the open_memmap() docs, you have to specify the shape when you open a file for writing. Otherwise, it writes "None" as the shape, which makes the y array open_memmap() call fail.

It looks like memmap arrays don't support resizing with .resize() (see numpy issue), but there's a workaround in this SO answer if you need that.

yut23
  • 2,624
  • 10
  • 18