Simultaneous optimization / time complexity of multi-dimensional Newton Raphson

Question

If I have N independent optimization problems, in general, would it be faster to solve each independently or combine the problems? Newton's method has time complexity log(p) * (time to calculate derivative ratio) for a given precision p, assuming reasonable seeds. If multivariate Newton Raphson time complexity does not scale with the number of dimensions, then complexity at a given p will still depend on the gradient/hessian calculations and the linear solver. I'm asking mostly for the sake of code efficiency: I want to fit likelihood functions of subsets of data with either per subset regularization or global regularization. If the two are similar in cost, I could implement one Newton optimizer that could take the various regularization penalties as functional arguments, switching between sparse and dense solvers when appropriate.

Below, the sparse solver beats the individual via loop, but I wonder if it's just python loop overhead (would I get better loop results using Cython)? The conversion from matrix stack to block diagonal is costly, but irrelevant to my implementation; I fill pre-existing gradient/hessian arrays on each iteration so the calculation time should be the same. My IDE froze when I used np.linalg.solve with a dense matrix for bigA, which opens up another problem that I'll probably need to find a more efficient way to solve the dense problems that will result from global regularization.

import timeit

setup = '''
import numpy as np; import scipy as sp
import scipy.sparse; import scipy.sparse.linalg
A = np.random.random((1000,10,10))
b = np.random.random((1000,10))
bigA = sp.sparse.block_diag(A, format = "csr")
bigb = b.flatten()
'''

expr1 = 'for i in range(1000): np.linalg.solve(A[i,...], b[i,...])'
expr2 = 'sp.sparse.linalg.spsolve(bigA, bigb)'

timeit.timeit(expr1, setup, number = 100)
# 1.2879039069994178
timeit.timeit(expr2, setup, number = 100)
# 0.45410968599753687

# with setup
imports = '''
import numpy as np
import scipy as sp
import scipy.sparse
import scipy.sparse.linalg
'''
setup1 = '''
A = np.random.random((1000,10,10))
b = np.random.random((1000,10))
for i in range(1000): 
    np.linalg.solve(A[i,...], b[i,...])
'''
setup2 = '''
A = np.random.random((1000,10,10))
b = np.random.random((1000,10))
bigA = sp.sparse.block_diag(A, format = "csr")
bigb = b.flatten()
sp.sparse.linalg.spsolve(bigA, bigb)
'''
timeit.timeit(setup1, imports, number = 100)
# 1.355804075999913
timeit.timeit(setup2, imports, number = 100)
# 24.209087412000372
sol1 = np.empty(1e4)
u = 0
for i in range(1000):   
    sol1[u:u+10] = np.linalg.solve(A[i,...], b[i,...])
    u += 10
sol2 = sp.sparse.linalg.spsolve(bigA, bigb)
np.sum((sol1 - sol2)**2)
# 2.49782849627e-14

score 1 · Accepted Answer · edited May 23 '17 at 12:02

You have to run K fully independent Newton methods of dimension N, and you are asking if they would be completed faster if you combine their linear equation systems into a single large system (N*K variables and equations) on each step.

Ideally, there should be no difference in performance. However, the following points should be considered:

You have to use some sort of sparse solver for the combined problem, otherwise it would work much slower. The solver must be able to detect the block-diagonal structure of the matrix and process each block independently. Otherwise linear solver would take O(K^3 N^3) time instead of O(K N^3).
You have to always avoid storing the full combined matrix in memory, otherwise the combined approach would spend O(K^2 N^2) time for checking that only diagonal blocks are nonzero. So you should create the matrix already in some block-diagonal storage structure.
Even if you satisfy the first two points, the combined approach would still have O(K N^2) memory footprint, while the single-problem approach has only O(N^2) memory footprint. If O(N^2) memory fits into some cache/RAM while O(K N^2) does not, then the combined approach would be slower.
The iterative methods may require different number of iterations to converge up to desired precision. In a combined approach, you have to reduce problem size on each iteration to take benefit from it. In single-problem approach, it saves time automatically.
When considering the combined problem, it is harder to take benefit from parallelization. You have to check how well parallelization inside sparse solver scales with the number of cores. In a single-problem case, you can trivially get perfect parallelization by distributing the loop by problems over all the cores. However, you might have hard time doing this parallelization in Python.
There is some overhead for using small vectors and matrices even in pure C. In numpy it is so huge that there are even some alternative implementations of numpy optimized for small arrays. If your systems really have about 10 variables, this overhead might be so big that it really makes it worth to combine the problems.

So as you see, combining the problems together is not a good idea in general, and solving small problems is a bad idea specifically in Python. I strongly suggest you to switch from Python to C/C++ to avoid such silly overheads. You can then write LU factorization yourself, that is not hard. If you use code generator to generate unrolled LU factorization for fixed dimension, you'll get great performance improvements (here is code for unrolled Cholesky factorization).

Thanks for the reply. While I eventually realized your first two points, I never considered the later 4. The expected subproblems are likely >1000 variables, with both within and between problem regularization that generally results in a dense matrix. I originally asked this question when I was thinking about a single optimer for various problem sizes and regularization structures. After thinking more about the expected data sets, I've currently writing a stochastic quasi-Newton optimizer that uses a BFGS-like hessian approximation in Python/Cython, so memory is no longer an issue. — deasmhumnha, Apr 11 '17 at 03:58
"switch from Python to C/C++": there's also [Numba](http://numba.pydata.org/), see also [SO questions/tagged/numba](https://stackoverflow.com/questions/tagged/numba) — denis, Jan 19 '19 at 12:04

Simultaneous optimization / time complexity of multi-dimensional Newton Raphson

1 Answers1