3

I have a sparse, triangular matrix (e.g. a distance matrix). In reality this would be a > 1M x 1M distance matrix with high sparsity.

from scipy.sparse import csr_matrix
X = csr_matrix([
      [1, 2, 3, 3, 1],
      [0, 1, 3, 3, 2],
      [0, 0, 1, 1, 3],
      [0, 0, 0, 1, 3],
      [0, 0, 0, 0, 1],
])

I want to subset this matrix to another triangular distance matrix. The indexes may be ordered differently and/or duplicated.

idx = np.matrix([1,2,4,2])
X2 = X[idx.T, idx]

This may result in the resulting matrix not being triangular, with some values missing from the upper triangle, and some values being duplicated in the lower triangle.

>>> X2.toarray()
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 0],
       [0, 1, 3, 1]])

How can I get the correct upper triangle matrix as efficiently as possible? Currently, I mirror the matrix before subsetting, and subset it to the triangle afterwards, but this doesn't feel particularly efficient, as it requires, at least, duplication of all entries.

# use transpose method, see https://stackoverflow.com/a/58806735/2340703
X = X + X.T - scipy.sparse.diags(X.diagonal())
X2 = X[idx.T, idx]
X2 = scipy.sparse.triu(X2, k=0, format="csr")
>>> X2.toarray()
array([[1., 3., 2., 3.],
       [0., 1., 3., 1.],
       [0., 0., 1., 3.],
       [0., 0., 0., 1.]])
Gregor Sturm
  • 2,792
  • 1
  • 25
  • 34
  • To clarify - you're sampling back up to the same size as your original distance matrix or you're subsetting to a smaller size? – CJR Jan 04 '21 at 15:53
  • not significantly smaller. Sometimes even larger due to repeated elements. – Gregor Sturm Jan 04 '21 at 16:03
  • I'm not sure it will help, but `scipy.csr` when doing indexing like this actually creates an `extractor` matrix, and gets the result with matrix multiplication. – hpaulj Jan 04 '21 at 16:34
  • Just to be clear, `X2` is a valid representation of your data, right? You are just using the `triu` to save memory? It might also help to look at what `sparse.triu` does. It applies a boolean mask to the `coo` array attributes, creating a new `coo` array without the lower tri values. – hpaulj Jan 04 '21 at 16:54
  • Yes, the idea was to save memory using `triu`. However, I start getting the impression that it's not worth it. Creating the `triu` appears to consume a lot of memory. Does it materialize the boolean mask? – Gregor Sturm Jan 04 '21 at 17:13
  • How sparse is your upper triangular matrix? Your example is dense, but do you expect to have ~1% of values be non-zero? More? Less? – CJR Jan 04 '21 at 17:31
  • way less than 1%. From a real-world example: `X.nnz / X.shape[0]**2` = 8.28e-05 – Gregor Sturm Jan 04 '21 at 17:40
  • 1
    I would consider implementing this as a compressed distance matrix in the same style that [pdist](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) does, but as a 1xN CSR matrix, and then using coordinate math to reindex it when you need to get specific values. That's a bit more of an XY solution though; I just don't think that there's a good way to do the specific thing you're asking to do. – CJR Jan 04 '21 at 17:52

4 Answers4

2

Here's a method that does not involve mirroring the data, and instead does manipulation of sparse indices to arrive at the desired result:

import scipy.sparse as sp

X2 = X[idx.T, idx]

# Extract indices and data (this is essentially COO format)
i, j, data = sp.find(X2)

# Generate indices with elements moved to upper triangle
ij = np.vstack([
  np.where(i > j, j, i),
  np.where(i > j, i, j)
])

# Remove duplicate elements
ij, ind = np.unique(ij, axis=1, return_index=True)

# Re-build the matrix
X2 = sp.coo_matrix((data[ind], ij)).tocsr()
jakevdp
  • 77,104
  • 11
  • 125
  • 160
1

Well, I can't get it to triu natively, but this should be faster:

idx = np.array([1,2,4,2])
i = np.stack(np.meshgrid(idx, idx))
X2 = X[i.min(0), i.max(0)]
 
array([[1, 3, 2, 3],
       [3, 1, 3, 1],
       [2, 3, 1, 3],
       [3, 1, 3, 1]])

So the whole process would be:

idx = np.array([1,2,4,2])
i = np.stack(np.meshgrid(idx, idx))
X2 = scipy.sparse.triu(X[i.min(0), i.max(0)], k=0, format="csr")

But I can't shake the feeling there's got to be a more optimized way.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • interesting approach! But in its current form it wouldn't scale since the meshgrid is a dense matrix with the same dimensions as `X`. – Gregor Sturm Jan 04 '21 at 15:35
1

This is not an improved working answer, but rather an exploration of what sparse indexing and triu does. It might give you ideas for taking more direct calculations. The fact that you start with a tri, and expect a tri, means this is not trivial task, not even with dense arrays (which index a whole lot faster).

sparse.csr indexing uses matrix multiplication. I'll illustrate this with dense arrays:

In [304]: X = np.array([
     ...:       [1, 2, 3, 3, 1],
     ...:       [0, 1, 3, 3, 2],
     ...:       [0, 0, 1, 1, 3],
     ...:       [0, 0, 0, 1, 3],
     ...:       [0, 0, 0, 0, 1],
     ...: ])
In [305]: idx = np.array([1,2,4,2])
In [306]: X[idx[:,None],idx]
Out[306]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 0],
       [0, 1, 3, 1]])
In [307]: m = np.array([[0,1,0,0,0],[0,0,1,0,0],[0,0,0,0,1],[0,0,1,0,0]])
In [308]: m@X@m.T
Out[308]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 0],
       [0, 1, 3, 1]])

and with the full distance array:

In [309]: X2 = X+X.T-np.diag(np.diag(X))
In [311]: X2[idx[:,None],idx]
Out[311]: 
array([[1, 3, 2, 3],
       [3, 1, 3, 1],
       [2, 3, 1, 3],
       [3, 1, 3, 1]])
In [312]: m@X2@m.T
Out[312]: 
array([[1, 3, 2, 3],
       [3, 1, 3, 1],
       [2, 3, 1, 3],
       [3, 1, 3, 1]])

I don't know if it's possible to construct a m that provides the desired result, upper tri or not directly from X (or X2)

In [316]: sparse.triu(Out[312])
Out[316]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in COOrdinate format>
In [317]: _.A
Out[317]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 3],
       [0, 0, 0, 1]])

sparse.triu does:

In [331]: A = sparse.coo_matrix(_312)
     ...: mask = A.row <= A.col 
In [332]: A
Out[332]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 16 stored elements in COOrdinate format>
In [333]: mask
Out[333]: 
array([ True,  True,  True,  True, False,  True,  True,  True, False,
       False,  True,  True, False, False, False,  True])

This mask array is 16 terms, A.nnz.

It then makes a new coo matrix with data/row/col arrays selected from A attributes:

In [334]: d=A.data[mask]
In [335]: r=A.row[mask]
In [336]: c=A.col[mask]
In [337]: d
Out[337]: array([1, 3, 2, 3, 1, 3, 1, 1, 3, 1])
In [338]: sparse.coo_matrix((d, (r,c)))
Out[338]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in COOrdinate format>
In [339]: _.A
Out[339]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 3],
       [0, 0, 0, 1]])

np.triu uses a mask like:

In [349]: np.tri(4,4,-1)
Out[349]: 
array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.]])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

To summarize all the excellent contributions, the short answer to this question is:

Don't use triangular matrices. There's nothing to gain in terms of speed or memory compared to using square matrices.

The reason for this is explained in @hpaulj's answer:

  • Slicing on sparse matrices uses matrix multiplication which is super-efficient. Rearranging the matrix into a triangular shape will be slow.
  • Using triu is an expensive operation since it materializes a dense mask.

This becomes evident when comparing @jakevdp's solution with just using a square matrix. Using the square form is faster and uses less memory.

The test uses a sparse triangular 800k x 800k distance matrix with high sparsity (%nnz << 1%). Data and code are available here.

# Running benchmark: Converting to square matrix
./benchmark.py squarify   6.29s  user 1.59s system 80% cpu 9.738 total
max memory:                4409 MB

# Running benchmark: @jakevdp's solution
./benchmark.py sparse_triangular   67.03s  user 3.01s system 99% cpu 1:10.15 total
max memory:                5209 MB

If one desperately wants to optimize this beyond using square matrices, @CJR's comment is a good starting point:

I would consider implementing this as a compressed distance matrix in the same style that pdist does, but as a 1xN CSR matrix, and then using coordinate math to reindex it when you need to get specific values.

Gregor Sturm
  • 2,792
  • 1
  • 25
  • 34