0

Let say I have HDF5 dataset with maxshape=(None,1000), chunk=(1,1000).

Then whenever I need to delete a some row I just zero-it (many):

  ds[ix,:] = 0

What is the fastest way to vacuum-zeroth-rows and resize the array ?


Now lets add a twist. I have a dict to resolve symbols =to=> ds_ix

{ name : ds_ix }..

What is the fastest way to vacuum and keep the correct ds_ix ?

kcw78
  • 7,131
  • 3
  • 12
  • 44
sten
  • 7,028
  • 9
  • 41
  • 63

1 Answers1

1

Did you mean resize the dataset when you asked 'resize the array?' (Also, I assume you meant maxshape=(None,1000).) If so, you use the .resize() method. However, if you aren't removing the last row(s), you will have to rearrange the non-zero data, then resize. (And you really don't need to zero out the row(s) since you are going to overwrite them.)
I can think of 2 approaches to rearrange the data: 1) use slice notation to define FROM and TO indices, or 2) read the dataset into a numpy array, delete the rows, and copy it back. Both involve disk I/O so it's not clear which would be faster without testing. It probably doesn't matter for small datasets and only a few deleted rows. I suspect the second method will be better if you plan to delete a lot of rows from large datasets. However, benchmark tests are required to confirm.

Note: be careful setting chunksize. Remember this controls the I/O size, and you will be doing a lot of I/O when you move rows. Setting it too small (or too large) can degrade performance. Setting to (1,1000) is probably too small. Recommended chunk size is 10 KiB to 1 MiB. (1,1000) float32 is 4 Kib.

Here are both approaches with a very small dataset.

Create a HDF5 file:

with h5py.File('SO_73353006.h5','w') as h5f:
    a0, a1 = 10, 5
    arr = np.arange(a0*a1).reshape(a0,a1)
    ds = h5f.create_dataset('test',data=arr,maxshape=(None,a1))

Method 1: move data, then resize dataset

with h5py.File('SO_73353006.h5','r+') as h5f:
    idx = 5
    ds = h5f['test']
    #ds[idx,:] = 0 # Not required since we will overwrite the row
    a0 = ds.shape[0]
    ds[idx:a0-1] = ds[idx+1:a0]
    ds.resize(a0-1,axis=0)

Method 2: extract array, delete row and copy data to resized dataset

with h5py.File('SO_73353006.h5','r+') as h5f:
    idx = 5
    ds = h5f['test']
    a0 = ds.shape[0]
    a1 = ds.shape[1]
    # read dataset into array and delete row
    ds_arr = ds[()]
    ds_arr = np.delete(ds_arr, obj=idx, axis=0)  
    # Resize dataset and load array
    ds.resize(a0-1,axis=0)  # same as above
    ds[:] = ds_arr[:]
    # Create a new dataset for comparison
    ds2 = h5f.create_dataset('test2',data=ds_arr,maxshape=(None,a1))
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • i'm zeroing many rows at different position – sten Aug 15 '22 at 03:02
  • 1
    Then extracting to an array and using `np.delete()` is probably faster (method 2). The `obj=` argument can be a tuple of row indices, so you can delete all rows in one call (and do it in memory). It may also be faster to delete the old dataset and create it again. Add `del h5f['test']` to my example. – kcw78 Aug 15 '22 at 12:51