h5py (HDF5) - random error with large ndarray - IOError: Can't prepare for writing data

Question

Running into a very strange issue when trying to create a rather large numpy ndarray dataset.

e.g.

import h5py
import numpy as np

test_h5=h5py.File('test.hdf5','w')

n=3055693983 # fail
n=10000000000 # works
n=40000000000 # fail
n=100000000000 # works
n=20000000000 #fail
n=512 # works

test_h5.create_dataset('matrix', shape=(n,n), dtype=np.int8, compression='gzip', chunks=(256,256))
print(test_h5['matrix'].shape)
a=test_h5['matrix']
a[0:256,0:256]=np.ones((256,256))

Chunk size is (256,256).

If the above ndarray is set to (512,512), everything works AOK.

If the above ndarray is set to (100000000000,100000000000), everything works AOK...

Ideally I wanted a ndarray of size (3055693983,3055693983) which fails with the following:

(3055693983, 3055693983) 
Traceback (most recent call last):   
File "h5.py", line 16, in 
    a[0:256,0:256]=np.ones((256,256))   
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2696)   
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2654)   
File "/home/user/anaconda2/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 618, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)   
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2696)   
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2654)   
File "h5py/h5d.pyx", line 221, in h5py.h5d.DatasetID.write (/home/ilan/minonda/conda-bld/work/h5py/h5d.c:3527)   
File "h5py/_proxy.pyx", line 132, in h5py._proxy.dset_rw (/home/ilan/minonda/conda-bld/work/h5py/_proxy.c:1889)   
File "h5py/_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite (/home/ilan/minonda/conda-bld/work/h5py/_proxy.c:1599) 
IOError: Can't prepare for writing data (Can't retrieve number of elements in file dataset)

Setting the ndarray to a few random sizes produced mixed results. Some work, some do not... I thought it may be something simple like the ndarray size not being evenly divisible by the chunk_size, but that does not appear to be the issue.

What am I missing here?

Did you actually compute the array size in bytes? Or Exabytes? — kakk11, Aug 15 '16 at 19:49
I am not sure that is the issue. I am never holding the entire np.array in memory, I only touch the array in slices, and load the data block by block. Memory usage can be tuned by altering the number of blocks to work on at any one time. I am now using np.int64 with a size of 2^32 (4294967296) and it is working fine. It must be something internal... It does not apear to be a buffer overflow issue, I can input and then re-extract of all my data just fine. — Bryan, Sep 01 '16 at 03:39
I asked if you did the actual math, you obviously didn't, well I can do it for you. You create NxN matrix with N=1e10. Which means you have 1e20 numbers in one array, right? If each number is four bytes, it will be 4e20 bytes = 400 exabytes. Current HDF5 does not support more that 4 exabyte file if I understand correctly, https://www.hdfgroup.org/HDF5/faq/limits.html So my bet is that it is buffer overflow and some arrays just get reasonable dimensions by accident. — kakk11, Sep 01 '16 at 10:13

h5py (HDF5) - random error with large ndarray - IOError: Can't prepare for writing data

0 Answers0