2

I'm trying to optimize a piece of code that solves a large sparse nonlinear system using an interior point method. During the update step, this involves computing the Hessian matrix H, the gradient g, then solving for d in H * d = -g to get the new search direction.

The Hessian matrix has a symmetric tridiagonal structure of the form:

A.T * diag(b) * A + C

I've run line_profiler on the particular function in question:

Line # Hits     Time  Per Hit % Time Line Contents
==================================================
   386                               def _direction(n, res, M, Hsig, scale_var, grad_lnprior, z, fac):
   387                               
   388                                   # gradient
   389   44  1241715  28220.8    3.7     g = 2 * scale_var * res - grad_lnprior + z * np.dot(M.T, 1. / n)
   390                               
   391                                   # hessian
   392   44  3103117  70525.4    9.3     N = sparse.diags(1. / n ** 2, 0, format=FMT, dtype=DTYPE)
   393   44 18814307 427597.9   56.2     H = - Hsig - z * np.dot(M.T, np.dot(N, M))    # slow!
   394                                   
   395                                   # update direction
   396   44 10329556 234762.6   30.8     d, fac = my_solver(H, -g, fac)
   397                                   
   398   44      111      2.5    0.0     return d, fac

Looking at the output it's clear that constructing H is by far the most costly step - it takes considerably longer than actually solving for the new direction.

Hsig and M are both CSC sparse matrices, n is a dense vector and z is a scalar. The solver I'm using requires H to be either a CSC or CSR sparse matrix.

Here's a function that produces some toy data with the same formats, dimensions and sparseness as my real matrices:

import numpy as np
from scipy import sparse

def make_toy_data(nt=200000, nc=10):

    d0 = np.random.randn(nc * (nt - 1))
    d1 = np.random.randn(nc * (nt - 1))
    M = sparse.diags((d0, d1), (0, nc), shape=(nc * (nt - 1), nc * nt),
                     format='csc', dtype=np.float64)

    d0 = np.random.randn(nc * nt)
    Hsig = sparse.diags(d0, 0, shape=(nc * nt, nc * nt), format='csc',
                        dtype=np.float64)

    n = np.random.randn(nc * (nt - 1))
    z = np.random.randn()

    return Hsig, M, n, z

And here's my original approach for constructing H:

def original(Hsig, M, n, z):
    N = sparse.diags(1. / n ** 2, 0, format='csc')
    H = - Hsig - z * np.dot(M.T, np.dot(N, M))    # slow!
    return H

Timing:

%timeit original(Hsig, M, n, z)
# 1 loops, best of 3: 483 ms per loop

Is there a faster way to construct this matrix?

Amir
  • 10,600
  • 9
  • 48
  • 75
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • My NumPy won't let me do `np.dot(M.T, np.dot(N, M))`. Does it definitely run on your machine? Do you want to do `N.dot(M)`? – YXD Apr 17 '14 at 22:02
  • @MrE I suspect it's probably a version issue - `numpy.dot()` is overridden for sparse matrices as of [this commit](https://github.com/scipy/scipy/pull/2732) 8 months ago. – ali_m Apr 17 '14 at 22:28
  • You don't need `dot`, just `Hsig - z * M.T * (N * M)`, but I don't know whether it's faster. – HYRY Apr 18 '14 at 02:12
  • @HYRY Yeah, I know - I almost always work with ndarrays rather than matrices, so the `np.dot()` syntax feels more natural to me than `*`. There's no real performance difference, since both just call the `.__mul__()` method of the sparse matrix. – ali_m Apr 18 '14 at 02:54

2 Answers2

3

I get close to a 4x speed-up in computing the product M.T * D * M out of the three diagonal arrays. If d0 and d1 are the main and upper diagonal of M, and d is the main diagonal of D, then the following code creates M.T * D * M directly:

def make_tridi_bis(d0, d1, d, nc=10):
    d00 = d0*d0*d
    d11 = d1*d1*d
    d01 = d0*d1*d
    len_ = d0.size
    data = np.empty((3*len_ + nc,))
    indices = np.empty((3*len_ + nc,), dtype=np.int)
    # Fill main diagonal
    data[:2*nc:2] = d00[:nc]
    indices[:2*nc:2] = np.arange(nc)
    data[2*nc+1:-2*nc:3] = d00[nc:] + d11[:-nc]
    indices[2*nc+1:-2*nc:3] = np.arange(nc, len_)
    data[-2*nc+1::2] = d11[-nc:]
    indices[-2*nc+1::2] = np.arange(len_, len_ + nc)
    # Fill top diagonal
    data[1:2*nc:2] = d01[:nc]
    indices[1:2*nc:2] = np.arange(nc, 2*nc)
    data[2*nc+2:-2*nc:3] = d01[nc:]
    indices[2*nc+2:-2*nc:3] = np.arange(2*nc, len_+nc)
    # Fill bottom diagonal
    data[2*nc:-2*nc:3] = d01[:-nc]
    indices[2*nc:-2*nc:3] = np.arange(len_ - nc)
    data[-2*nc::2] = d01[-nc:]
    indices[-2*nc::2] = np.arange(len_ - nc ,len_)

    indptr = np.empty((len_ + nc + 1,), dtype=np.int)
    indptr[0] = 0
    indptr[1:nc+1] = 2
    indptr[nc+1:len_+1] = 3
    indptr[-nc:] = 2
    np.cumsum(indptr, out=indptr)

    return sparse.csr_matrix((data, indices, indptr), shape=(len_+nc, len_+nc))

If your matrix M were in CSR format, you can extract d0 and d1 as d0 = M.data[::2] and d1 = M.data[1::2], I modified you toy data making routine to return those arrays as well, and here's what I get:

In [90]: np.allclose((M.T * sparse.diags(d, 0) * M).A, make_tridi_bis(d0, d1, d).A)
Out[90]: True

In [92]: %timeit make_tridi_bis(d0, d1, d)
10 loops, best of 3: 124 ms per loop

In [93]: %timeit M.T * sparse.diags(d, 0) * M
1 loops, best of 3: 501 ms per loop

The whole purpose of the above code is to take advantage of the structure of the non-zero entries. If you draw a diagram of the matrices you are multiplying together, it is relatively easy to convince yourself that the main (d_0) and top and bottom (d_1) diagonals of the resulting tridiagonal matrix are simply:

d_0 = np.zeros((len_ + nc,))
d_0[:len_] = d00
d_0[-len_:] += d11

d_1 = d01

The rest of the code in that function is simply building the tridiagonal matrix directly, as calling sparse.diags with the above data is several times slower.

Jaime
  • 65,696
  • 17
  • 124
  • 159
  • That is impressively quick. I'm still trying to wrap my head around how it works. If I wanted to construct just the main and upper/lower diagonals of `M.T * D * M` as dense vectors, how would I obtain these? – ali_m Apr 18 '14 at 14:56
0

I tried running your test case and had problems with the np.dot(N, M). I didn't dig into it, but I think my numpy/sparse combo (both pretty new) had problems using np.dot on sparse arrays.

But H = -Hsig - z*M.T.dot(N.dot(M)) runs just fine. This uses the sparse dot.

I haven't run a profile, but here are Ipython timings for several parts. It takes longer to generate the data than to do that double dot.

In [37]: timeit Hsig,M,n,z=make_toy_data()
1 loops, best of 3: 2 s per loop

In [38]: timeit N = sparse.diags(1. / n ** 2, 0, format='csc')
1 loops, best of 3: 377 ms per loop

In [39]: timeit H = -Hsig - z*M.T.dot(N.dot(M))
1 loops, best of 3: 1.55 s per loop

H is a

<2000000x2000000 sparse matrix of type '<type 'numpy.float64'>'
    with 5999980 stored elements in Compressed Sparse Column format>
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • As I mentioned above, the issues with `np.dot(N, M)` are version-related - in recent versions of scipy, the `.dot()` method of sparse arrays overrides `np.dot()`, so in my case the two are equivalent (and for some reason the `np.dot(A, B)` syntax feels more comfortable to me :-)). Regarding the timings, I only construct `Hsig` and `M` once but the `_direction()` function is called 1000s of times, so I really do care about the overhead involved in constructing `H`. – ali_m Apr 17 '14 at 22:39
  • 1
    The sparse .dot is just matrix multiplication, so you could write `M.T*(N*M)` or even `M.T*N*M`. Not that there's any speed difference. The sparsetools.csr.h file references this SMMP paper: http://www.mgnet.org/~douglas/Preprints/pub0034.pdf – hpaulj Apr 18 '14 at 02:35