-1

I'm getting a memory error when trying to pickle a large numpy array for a deep learning problem shape: (7451, 1500, 1500, 1)). That stated, I see a few posts on klepto and read the docs but I'm not sure how to actually use klepto to save as a pickle file.

Can anyone break it down to a fifth grade level for me?

This is throwing the memory error:

pickle_out = open("X.pickle", "wb")
pickle.dumps(X, pickle_out)
pickle_out.close()
Jordan
  • 1,415
  • 3
  • 18
  • 44
  • The memory error you're getting is probably because the object you're trying to pickle is too large to hold in memory. I've only encountered this when parsing xml files (loading the whole file into memory then trying to parse it). I solved it by parsing iteratively. Pickle has a `dump` and `dumps` methods... Can you use `dumps` and write that to the file iteratively? – ron_g May 22 '19 at 15:52
  • Hi @rong. I tried dumps and got the following error: `TypeError: an integer is required (got type _io.BufferedWriter)` I added the code I used above. – Jordan May 22 '19 at 16:36

2 Answers2

2

I'm the klepto author. If you are indeed just trying to pickle a numpy array, the best approach is to just use the built-in dump method on the array (unless the array is too large to fit within memory constraints).

Almost any code that does serialization uses one of the serialization packages (dill, cloudpickle or pickle), unless there's a serialization method built-in to the object itself, like in numpy. joblib uses cloudpickle, and both cloudpickle and dill utilize the internal serialization that a numpy array itself provides (pickle does not use it, and thus the serialization bloats and can cause memory failures).

>>> import numpy as np
>>> a = np.random.random((1500,1500,1500,1))
>>> a.dump('foo.pkl')

If the above still gives you a memory error, then joblib, klepto, dill, or otherwise really can't help you unless you break up the array into smaller chunks -- or potentially use a dask array (which is designed for large array data). I think your array is large enough that it should cause a memory error (I tested it on my own system) even with the above optimally efficient method, so you'll either need to break the array into chunks, or store it as a dask array.

To be clear, klepto is intended for large non-array data (like tables or dicts), while dask is intended for large array data.

Another option is to use a numpy.memmap array, which directly writes the array to a file -- bypassing memory. These are a bit complex to use, and is what dask attempts to do for you with a simple interface.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
0

When I faced similar problem I could solve it using joblib. You would need first to have sklearn library installed, which can be done for example with

pip install sklearn

That is just basic idea, to better know how to install it go to https://scikit-learn.org/stable/install.html So, everything is pretty plane and is illustrated in following code

from sklearn.externals import joblib
import numpy as np
array=np.array([0,1,2]) # thats explanatory array, you should use your instead

filename = 'array.sav'
joblib.dump(array, filename)  

Then, to load back your data when u need to use it:

array = load(filename, mmap_mode='r')
Igor sharm
  • 396
  • 1
  • 10
  • Hi @Igorsharm. Thank you. Where does this save it to? – Jordan May 22 '19 at 16:38
  • This should not work for the size of the stated array, unless you have a computer that has **a lot** of memory. – Mike McKerns May 22 '19 at 18:41
  • It would save to the current directory in a file named `array.sav` – Mike McKerns May 22 '19 at 20:56
  • @MikeMcKerns this just works for some large objects, which can not be saved by pickle even if they can be worked with, recently I used it to save 12GB sparse matrix, so it should work for the problem stated too. Anyway thanks for the explanation of the topic in the other answer, it was useful for me too! – Igor sharm May 23 '19 at 07:24
  • It absolutely works for some large objects that can't be serialized with `pickle`, as it uses the `dump` that is included with `numpy`, and not `pickle`. However, my point was that it will hit a size limit, which (depending on your machine) may be reached by the OP's given sized `array`. – Mike McKerns May 23 '19 at 12:37