Pandas/Numpy and Larger than memory array of strings

Question

I have a data set that's larger than memory and I need to process it. I am not experienced in this subject thus any directions can help.

I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory. I have seen that pandas, numpy and python all support some form of memmap but I don't exactly understand how to go about and handle it. I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

Which means the resize isn't being saved to disk

Any suggestion?

You might want to look into dask, which was designed for data that does not fit into memory. It has a pandas-like interface. https://dask.org/ — jkr, May 04 '22 at 02:43
Thanks @jakub! I looked at dask and implemented some solution! However I am struggling to use my compressed files and parallelize it. I considering migrating my entire dataset into a different format or maybe even a database as dask has been proving difficult in that area. — Yorai Levi, May 13 '22 at 13:39

score 0 · Answer 1 · answered May 04 '22 at 02:39

0

np.require() makes a copy of the memmap array, since it doesn't "own" its data. According to the open_memmap() docs, you have to specify the shape when you open a file for writing. Otherwise, it writes "None" as the shape, which makes the y array open_memmap() call fail.

It looks like memmap arrays don't support resizing with .resize() (see numpy issue), but there's a workaround in this SO answer if you need that.

answered May 04 '22 at 02:39

yut23

2,624
10
18

I need a solution without specifying the shape manually – Yorai Levi May 04 '22 at 02:44

Pandas/Numpy and Larger than memory array of strings

1 Answers1