I have a data set that's larger than memory and I need to process it. I am not experienced in this subject thus any directions can help.
I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory.
I have seen that pandas, numpy and python all support some form of memmap
but I don't exactly understand how to go about and handle it.
I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever
# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
shape, fortran_order, dtype = _read_array_header(fp, version)
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
shape, fortran_order, dtype = _read_array_header(fp, version)
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
Which means the resize isn't being saved to disk
Any suggestion?