I have a rather large HDF5 file generated by PyTables that I am attempting to read on a cluster. I am running into a problem with NumPy as I read in an individual chunk. Let's go with the example:
The total shape of the array within in the HDF5 file is,
In [13]: data.shape
Out[13]: (21933063, 800, 3)
Each entry in this array is a np.float64
.
I am having each node read slices of size (21933063,10,3)
. Unfortunately, NumPy seems to be unable to read all 21 million subslices at once. I have tried to do this sequentially by dividing up these slices into 10 slices of size (2193306,10,3)
and then using the following reduce to get things working:
In [8]: a = reduce(lambda x,y : np.append(x,y,axis=0), [np.array(data[i* \
chunksize: (i+1)*chunksize,:10],dtype=np.float64) for i in xrange(k)])
In [9]:
where 1 <= k <= 10
and chunksize = 2193306
. This code works for k <= 9
; otherwise I get the following:
In [8]: a = reduce(lambda x,y : np.append(x,y,axis=0), [np.array(data[i* \
chunksize: (i+1)*chunksize,:10],dtype=np.float64) for i in xrange(k)])
Floating point exception
home@mybox 00:00:00 ~
$
I tried using Valgrind's memcheck
tool to figure out what is going on and it seems as if PyTables is the culprit. The two main files that show up in the trace are libhdf5.so.6
and a file related to blosc
.
Also, note that if I have k=8
, I get:
In [12]: a.shape
Out[12]: (17546448, 10, 3)
But if I append the last subslice, I get:
In [14]: a = np.append(a,np.array(data[8*chunksize:9*chunksize,:10], \
dtype=np.float64))
In [15]: a.shape
Out[15]: (592192620,)
Does anyone have any ideas of what to do? Thanks!