Fastest way to create a sparse matrix of the form A.T * diag(b) * A + C?

Question

I'm trying to optimize a piece of code that solves a large sparse nonlinear system using an interior point method. During the update step, this involves computing the Hessian matrix H, the gradient g, then solving for d in H * d = -g to get the new search direction.

The Hessian matrix has a symmetric tridiagonal structure of the form:

A.T * diag(b) * A + C

I've run line_profiler on the particular function in question:

Line # Hits     Time  Per Hit % Time Line Contents
==================================================
   386                               def _direction(n, res, M, Hsig, scale_var, grad_lnprior, z, fac):
   387                               
   388                                   # gradient
   389   44  1241715  28220.8    3.7     g = 2 * scale_var * res - grad_lnprior + z * np.dot(M.T, 1. / n)
   390                               
   391                                   # hessian
   392   44  3103117  70525.4    9.3     N = sparse.diags(1. / n ** 2, 0, format=FMT, dtype=DTYPE)
   393   44 18814307 427597.9   56.2     H = - Hsig - z * np.dot(M.T, np.dot(N, M))    # slow!
   394                                   
   395                                   # update direction
   396   44 10329556 234762.6   30.8     d, fac = my_solver(H, -g, fac)
   397                                   
   398   44      111      2.5    0.0     return d, fac

Looking at the output it's clear that constructing H is by far the most costly step - it takes considerably longer than actually solving for the new direction.

Hsig and M are both CSC sparse matrices, n is a dense vector and z is a scalar. The solver I'm using requires H to be either a CSC or CSR sparse matrix.

Here's a function that produces some toy data with the same formats, dimensions and sparseness as my real matrices:

import numpy as np
from scipy import sparse

def make_toy_data(nt=200000, nc=10):

    d0 = np.random.randn(nc * (nt - 1))
    d1 = np.random.randn(nc * (nt - 1))
    M = sparse.diags((d0, d1), (0, nc), shape=(nc * (nt - 1), nc * nt),
                     format='csc', dtype=np.float64)

    d0 = np.random.randn(nc * nt)
    Hsig = sparse.diags(d0, 0, shape=(nc * nt, nc * nt), format='csc',
                        dtype=np.float64)

    n = np.random.randn(nc * (nt - 1))
    z = np.random.randn()

    return Hsig, M, n, z

And here's my original approach for constructing H:

def original(Hsig, M, n, z):
    N = sparse.diags(1. / n ** 2, 0, format='csc')
    H = - Hsig - z * np.dot(M.T, np.dot(N, M))    # slow!
    return H

Timing:

%timeit original(Hsig, M, n, z)
# 1 loops, best of 3: 483 ms per loop

Is there a faster way to construct this matrix?

My NumPy won't let me do `np.dot(M.T, np.dot(N, M))`. Does it definitely run on your machine? Do you want to do `N.dot(M)`? — YXD, Apr 17 '14 at 22:02
@MrE I suspect it's probably a version issue - `numpy.dot()` is overridden for sparse matrices as of [this commit](https://github.com/scipy/scipy/pull/2732) 8 months ago. — ali_m, Apr 17 '14 at 22:28
You don't need `dot`, just `Hsig - z * M.T * (N * M)`, but I don't know whether it's faster. — HYRY, Apr 18 '14 at 02:12
@HYRY Yeah, I know - I almost always work with ndarrays rather than matrices, so the `np.dot()` syntax feels more natural to me than `*`. There's no real performance difference, since both just call the `.__mul__()` method of the sparse matrix. — ali_m, Apr 18 '14 at 02:54

Jaime · Accepted Answer · 2014-04-18T16:49:55.267

I get close to a 4x speed-up in computing the product M.T * D * M out of the three diagonal arrays. If d0 and d1 are the main and upper diagonal of M, and d is the main diagonal of D, then the following code creates M.T * D * M directly:

def make_tridi_bis(d0, d1, d, nc=10):
    d00 = d0*d0*d
    d11 = d1*d1*d
    d01 = d0*d1*d
    len_ = d0.size
    data = np.empty((3*len_ + nc,))
    indices = np.empty((3*len_ + nc,), dtype=np.int)
    # Fill main diagonal
    data[:2*nc:2] = d00[:nc]
    indices[:2*nc:2] = np.arange(nc)
    data[2*nc+1:-2*nc:3] = d00[nc:] + d11[:-nc]
    indices[2*nc+1:-2*nc:3] = np.arange(nc, len_)
    data[-2*nc+1::2] = d11[-nc:]
    indices[-2*nc+1::2] = np.arange(len_, len_ + nc)
    # Fill top diagonal
    data[1:2*nc:2] = d01[:nc]
    indices[1:2*nc:2] = np.arange(nc, 2*nc)
    data[2*nc+2:-2*nc:3] = d01[nc:]
    indices[2*nc+2:-2*nc:3] = np.arange(2*nc, len_+nc)
    # Fill bottom diagonal
    data[2*nc:-2*nc:3] = d01[:-nc]
    indices[2*nc:-2*nc:3] = np.arange(len_ - nc)
    data[-2*nc::2] = d01[-nc:]
    indices[-2*nc::2] = np.arange(len_ - nc ,len_)

    indptr = np.empty((len_ + nc + 1,), dtype=np.int)
    indptr[0] = 0
    indptr[1:nc+1] = 2
    indptr[nc+1:len_+1] = 3
    indptr[-nc:] = 2
    np.cumsum(indptr, out=indptr)

    return sparse.csr_matrix((data, indices, indptr), shape=(len_+nc, len_+nc))

If your matrix M were in CSR format, you can extract d0 and d1 as d0 = M.data[::2] and d1 = M.data[1::2], I modified you toy data making routine to return those arrays as well, and here's what I get:

In [90]: np.allclose((M.T * sparse.diags(d, 0) * M).A, make_tridi_bis(d0, d1, d).A)
Out[90]: True

In [92]: %timeit make_tridi_bis(d0, d1, d)
10 loops, best of 3: 124 ms per loop

In [93]: %timeit M.T * sparse.diags(d, 0) * M
1 loops, best of 3: 501 ms per loop

The whole purpose of the above code is to take advantage of the structure of the non-zero entries. If you draw a diagram of the matrices you are multiplying together, it is relatively easy to convince yourself that the main (d_0) and top and bottom (d_1) diagonals of the resulting tridiagonal matrix are simply:

d_0 = np.zeros((len_ + nc,))
d_0[:len_] = d00
d_0[-len_:] += d11

d_1 = d01

The rest of the code in that function is simply building the tridiagonal matrix directly, as calling sparse.diags with the above data is several times slower.

That is impressively quick. I'm still trying to wrap my head around how it works. If I wanted to construct just the main and upper/lower diagonals of `M.T * D * M` as dense vectors, how would I obtain these? — ali_m, Apr 18 '14 at 14:56

score 0 · Answer 2 · answered Apr 17 '14 at 22:04

0

I tried running your test case and had problems with the np.dot(N, M). I didn't dig into it, but I think my numpy/sparse combo (both pretty new) had problems using np.dot on sparse arrays.

But H = -Hsig - z*M.T.dot(N.dot(M)) runs just fine. This uses the sparse dot.

I haven't run a profile, but here are Ipython timings for several parts. It takes longer to generate the data than to do that double dot.

In [37]: timeit Hsig,M,n,z=make_toy_data()
1 loops, best of 3: 2 s per loop

In [38]: timeit N = sparse.diags(1. / n ** 2, 0, format='csc')
1 loops, best of 3: 377 ms per loop

In [39]: timeit H = -Hsig - z*M.T.dot(N.dot(M))
1 loops, best of 3: 1.55 s per loop

H is a

<2000000x2000000 sparse matrix of type '<type 'numpy.float64'>'
    with 5999980 stored elements in Compressed Sparse Column format>

answered Apr 17 '14 at 22:04

hpaulj

221,503
14
230
353

As I mentioned above, the issues with `np.dot(N, M)` are version-related - in recent versions of scipy, the `.dot()` method of sparse arrays overrides `np.dot()`, so in my case the two are equivalent (and for some reason the `np.dot(A, B)` syntax feels more comfortable to me :-)). Regarding the timings, I only construct `Hsig` and `M` once but the `_direction()` function is called 1000s of times, so I really do care about the overhead involved in constructing `H`. – ali_m Apr 17 '14 at 22:39
1

The sparse .dot is just matrix multiplication, so you could write `M.T*(N*M)` or even `M.T*N*M`. Not that there's any speed difference. The sparsetools.csr.h file references this SMMP paper: http://www.mgnet.org/~douglas/Preprints/pub0034.pdf – hpaulj Apr 18 '14 at 02:35

Fastest way to create a sparse matrix of the form A.T * diag(b) * A + C?

2 Answers2

Linked