There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.
QUESTION:
I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.
I initially thought that numpy.memmap is the way to go, but when I issue a command like
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
the RAM floods and the machine slows to a halt.
After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.
IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!