5

There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.

QUESTION:

I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.

I initially thought that numpy.memmap is the way to go, but when I issue a command like

mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))

the RAM floods and the machine slows to a halt.

After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.

IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
KartMan
  • 369
  • 3
  • 19
  • Would be fair to state a few words on *"...do some processing..."* as **that decides** about any feasible approach more than just the static sizes. If *that* computation strategy allows, there might be some ways how to introduce a viable process pipelining / segmentation / vectorisation / MapReduce. Thanks for your kind re-consideration. – user3666197 Oct 06 '14 at 12:16
  • The processing is usually basic computational routines, for e.g. low-pass filtering the numerical values in the smaller numpy arrays and then putting them in the larger array. The other operations will be of similar complexity. – KartMan Oct 06 '14 at 12:35
  • Good to hear you stay away from SumProd calculus, convolutions and other forms of backward-stepping / forward-stepping dependencies. This than reduces your issue to find an appropriate "representation" of the matrix-data, that can serve your needs fast and with about a linear-scaling on growing bigger. – user3666197 Oct 06 '14 at 13:07
  • I think the question that you need to ask yourself is what you would like to do with the big array once you have it stored on disk. I do not understand what yopu mean by " each dimension in the array contains...", because from your example I conclude that you would like to have a big 2D array. From what I do understand, Pytables might be suited. For a quickstart to pytables check out http://pytables.github.io/usersguide/tutorials.html – Ben K. Nov 04 '14 at 13:23

2 Answers2

4

That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.

Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.

When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:

mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')

for k in range(number_of_arrays):
  smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
  smallarray = do_something_with_array(smallarray)
  mmapData[:,k] = smallarray

It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.

Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!

favba
  • 175
  • 7
  • Agreed - the whole point of memmap is that it doesn't load much data into RAM and keeps it on the disk. Sure a cache is also built - but its size is kept in check. – J.J Nov 09 '15 at 22:34
  • Also for me using code like this will fill up RAM, as stated in the in the documentation: "Deletion flushes memory changes to disk before removing the object:" perhaps that's why? – CodeNoob Jan 14 '21 at 10:41
0

HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.

There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)

MRocklin
  • 55,641
  • 23
  • 163
  • 235