1

Some time ago I wrote a parallel FORTRAN code that diagonalises very large dense matrices on a supercomputer. These matrices are read from dense chunked hdf5 datasets. Now I want to use this code for very sparse matrices that are constructed using Python.

However when i try to write my data to dense hdf5 file it takes very long time. Sparse matrix is composed of 3x3 nonzero blocks and is stored using three arrays: rows, cols, data. I tried to write each block iteratively:

fl = h5py.File(filepath, 'w')
dataset = fl.create_dataset("matrix", shape, dtype='d',
                            chunks=(60, 60), compression='szip',
                            fillvalue=0)

for row, col, val in zip(rows, cols, data):
    dataset[row*3: row*3 + 3, col*3: col*3 + 3] = val

fl.close()

For small matrix composed of 14848 nonzero blocks (dense shape is (1536, 1536)) it takes 2.6 seconds to write. And I need to write matrices which are more than 100 times larger (with much higher sparsity).

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Baranas
  • 71
  • 1
  • 6
  • So `row` and `col` are 1d (14848,) shape, and `data` is 3d (14848,3,3) shape? – hpaulj Oct 26 '19 at 16:32
  • Yes, this is the case – Baranas Oct 27 '19 at 06:53
  • I'm dealing with approximately the same issue, but no sparsity- I suspect a solution can be found by tuning the dataset blocks. I will post an answer if I am successful... – lensonp Dec 18 '20 at 22:04
  • I ended up looking here: https://stackoverflow.com/questions/43533913/optimising-hdf5-dataset-for-read-write-speed And then I aligned my chunks with the dataset rows. Much faster that way, in my case, but it seems like chunk alignment can be quite problem-specific) – lensonp Dec 18 '20 at 22:49

1 Answers1

1

I don't know if this will help, either with speed or convenience, but:

scipy.sparse has a block compressed format that reminds me of your data. It's not exactly the same.

From the docs of the sparse.bsr_matrix:

In [375]: >>> indptr = np.array([0, 2, 3, 6]) 
     ...: >>> indices = np.array([0, 2, 2, 0, 1, 2]) 
     ...: >>> data = np.array([1, 2, 3, 4, 5, 6]).repeat(4).reshape(6, 2, 2) 
     ...: M = sparse.bsr_matrix((data,indices,indptr), shape=(6, 6)) 
     ...:  
In [377]: M                                                                     
Out[377]: 
<6x6 sparse matrix of type '<class 'numpy.int64'>'
    with 24 stored elements (blocksize = 2x2) in Block Sparse Row format>
In [378]: M.data                                                                
Out[378]: 
array([[[1, 1],
        [1, 1]],

       [[2, 2],
        [2, 2]],

       [[3, 3],
        [3, 3]],

       [[4, 4],
        [4, 4]],

       [[5, 5],
        [5, 5]],

       [[6, 6],
        [6, 6]]])
In [379]: M.data.shape                                                          
Out[379]: (6, 2, 2)
In [380]: M.indptr                                                              
Out[380]: array([0, 2, 3, 6], dtype=int32)
In [381]: M.indices                                                             
Out[381]: array([0, 2, 2, 0, 1, 2], dtype=int32)

This is the compressed format, with indptr and indices rather than col and row arrays. sparse doesn't have a block version of the coo format.

Anyways, sparse has (relatively) fast methods for converting between formats.

In [382]: Mo = M.tocoo()                                                        

In [384]: (Mo.row, Mo.col, Mo.data)                                             
Out[384]: 
(array([0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4,
        5, 5], dtype=int32),
 array([0, 1, 0, 1, 4, 5, 4, 5, 4, 5, 4, 5, 0, 1, 0, 1, 2, 3, 2, 3, 4, 5,
        4, 5], dtype=int32),
 array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
        6, 6]))

This data could be used to fill in a zeros array with one expression:

In [385]: A = np.zeros((6,6),int)                                               
In [386]: A[Mo.row, Mo.col] = Mo.data                                           
In [387]: A                                                                     
Out[387]: 
array([[1, 1, 0, 0, 2, 2],
       [1, 1, 0, 0, 2, 2],
       [0, 0, 0, 0, 3, 3],
       [0, 0, 0, 0, 3, 3],
       [4, 4, 5, 5, 6, 6],
       [4, 4, 5, 5, 6, 6]])
In [388]: M.A                                                                   
Out[388]: 
array([[1, 1, 0, 0, 2, 2],
       [1, 1, 0, 0, 2, 2],
       [0, 0, 0, 0, 3, 3],
       [0, 0, 0, 0, 3, 3],
       [4, 4, 5, 5, 6, 6],
       [4, 4, 5, 5, 6, 6]])

https://docs.h5py.org/en/stable/high/dataset.html#fancy-indexing does warn that h5py fancy indexing can be slow, especially if it spans chunks. Still it might be faster than iteratively writing 3x3 slices.

So the unknowns are:

  • how to convert your block format to bsr
  • the speed of the bsr.tocoo() step
  • the relative speed of fancy h5py write
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for reply. Actually I already tried converting to `bsr` (then to `coo`) for manipulations. However `h5py` indexing with two arrays (`rows` and `cols`) is not supported. Then i tried iterating row by row and selecting nonzero elements residing in the current row. It took even longer... – Baranas Oct 27 '19 at 07:05