Woodbury identity for fast matrix inversion—slower than expected

Question

The Woodbury matrix identity states that the inverse of a rank-k correction of some matrix can be computed by doing a rank-k correction to the inverse of the original matrix.

If A is a p × p full rank matrix that is rank corrected by UCV where U is p × k, C is k × k, and V is k × p, then the Woodbury identity is:

(A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + V A^{-1} U)^{-1} V A^{-1}

The key is that rather than inverting a p × p matrix, you invert a k × k matrix. In many applications, we can assume k < p. Inverting A can be fast in some cases, for example if A is a diagonal matrix.

I've implemented this here, assuming A is diagonal and that C is the identity:

def woodbury(A, U, V, k):
    A_inv = np.diag(1./np.diag(A))  # Fast matrix inversion of a diagonal.
    B_inv = np.linalg.inv(np.eye(k) + V @ A_inv @ U)
    return A_inv - (A_inv @ U @ B_inv @ V @ A_inv)

and sanity checked my implementation by verifying that

n = 100000
p = 1000
k = 100
A = np.diag(np.random.randn(p))
U = np.random.randn(p, k)
V = U.T
M = U @ V + A
M_inv = woodbury(A, U, V, k)
assert np.allclose(M @ M_inv, np.eye(p))

But when I actually compare this to numpy.linalg.inv, my Woodbury function is not nearly as fast as I would have expected. I would expect the time to invert to grow cubically with th dimensionality p. But my results are:

My question is: why is the Woodbury method so slow? Is it simply because I'm comparing Python code to LAPACK or is there something else going on?

EDIT: My experiments with einsum() vs. broadcasting

I implemented three versions: (1) using einsum and einsum_path, (2) using broadcasting per the accepted answer, and (3) using both. Here is my implementation using einsum, optimized using einsum_path:

def woodbury_einsum(A, U, V, k):
    A_inv = np.diag(1./np.diag(A))
    tmp   = np.einsum('ab,bc,cd->ad',
                      V, A_inv, U,
                      optimize=['einsum_path', (1, 2), (0, 1)])
    B_inv = np.linalg.inv(np.eye(k) + tmp)
    tmp   = np.einsum('ab,bc,cd,de,ef->af',
                      A_inv, U, B_inv, V, A_inv,
                      optimize=['einsum_path', (0, 1), (0, 1), (0, 2), (0, 1)])
    return A_inv - tmp

And results are here:

So avoiding the computational cost of matrix multiplication with a diagonal matrix is faster than optimizing the ordering of matrix multiplication and memory footprint using einsum().

Just a speculation: there are many possible orders in which you could evaluate `A_inv @ U @ B_inv @ V @ A_inv`, possibly with different costs. You can try to use `numpy.einsum_path` to find an optimal sequence to carry our the matrix multiplications. — hilberts_drinking_problem, Nov 30 '18 at 20:58
This worked. If you want to write this up as an answer, I'm happy to accept it. It took me a bit to figure out how to use the results of `einsum_path`. — jds, Nov 30 '18 at 23:35

score 6 · Accepted Answer · answered Nov 30 '18 at 23:47

As you mention, inverting A + UCV can be made faster by using the Woodbury technique in the case when A is diagonal. That is, in the Woodbury formula your multiplications by A^{-1} should happen in O(p x m) time instead of O(p x m x p) since all you're doing is scaling the rows/columns of the right/left term.

However, that's not what you're doing in the following code!

def woodbury(A, U, V, k):
    A_inv = np.diag(1./np.diag(A))
    B_inv = np.linalg.inv(np.eye(k) + V @ A_inv @ U)
    return A_inv - (A_inv @ U @ B_inv @ V @ A_inv)

Your A_inv is a full p x p matrix! Yes, the diagonal is the only part that contains non-zeros but the arithmetic with all of the zero elements will still be carried out in this dense matrix! Instead, you should use Numpys broadcasting abilities to avoid this unnecessary work. (Or, as sparse diagonal matrix using Scipy's sparse module.)

For example,

def efficient_woodbury(A, U, V, k):
    A_inv_diag = 1./np.diag(A)  # note! A_inv_diag is a vector!
    B_inv = np.linalg.inv(np.eye(k) + (V * A_inv_diag) @ U)
    return np.diag(A_inv_diag) - (A_inv_diag.reshape(-1,1) * U @ B_inv @ V * A_inv_diag)

The product V * A_inv_diag is equivalent to your V @ A_inv but runs in only O(p x k) time as opposed to O(p x k x p). Similarly for the other replaced products.

I re-ran your timings on my (slightly faster) machine and produced the following plot:

Much clearer performance distinction!

As Yakym mentions in the comments above, you can discover better access patterns for the actual matrix multiplications. Haven’t used einsum_path before but looks interesting. — Chris Swierczewski, Dec 01 '18 at 01:08
Very nice. I was just using the knowledge of the diagonal to compute the matrix inverse quickly, but using broadcasting makes a lot of sense. I edited my answer to compare a method that uses `einsum` vs. this method (`woodbury_so`). You can see that just avoiding matrix multiplication when possible is faster than optimizing those operations. — jds, Dec 01 '18 at 15:29

Woodbury identity for fast matrix inversion—slower than expected

1 Answers1