Fastest way to minimize sum of squares involving large weighted matrices

Question

I need to minimize a sum of squares between two large (10000 by 40000) matrices: Σ(X-A)^2 where X is a concatenation of two matrices (10000 by 20000) and each segment is weighted (W) individually, see pic for inner function..

There is also a constraint where the sum of the weights must equal 1 (W1 + W2 = 1). I'm currently using the SLSQP method in scipy optimize to get the optimal weight values but it takes about 10 minutes on my computer to process. Is there a better way to do this that wouldn't take so long? I've also attached the code I'm using below.

from scipy.optimize import minimize
import numpy

def objective(W,X1,X2,A):
    W1=W[0]
    W2=W[1]
    X=numpy.hstack((W1*X1,W2*X2))
    return numpy.sum((X-A)**2)

def constraint1(W):
    W1=W[0]
    W2=W[1]
    return W1+W2-1

x0=[[1,0]]
cons = {'type': 'eq', 'fun':constraint1}

#Random data only used for purposes of example   
segment_1 = numpy.random.rand(10000, 20000)
segment_2 = numpy.random.rand(10000, 20000)
A = numpy.random.rand(10000, 40000)

sol=minimize(objective,x0[0],args=(segment_1,segment_2,A),method='SLSQP',constraints=cons)

Can you fix that code? It's 1) missing imports 2) missing data 3) broken due to mixed indentation 4) has dead variables. The problem looks trivial but it's not much fun to work what's given here. — sascha, Jan 02 '21 at 12:57
I checked and 99% of the values are non-zero, the matrices include both positive and negative values. — user1361488, Jan 02 '21 at 14:32

score 2 · Answer 1 · answered Jan 02 '21 at 13:42

Intro

Okay... there are tons of potential improvements and i'm only touching the path of least resistance (due to being lazy and because information is sparse).

If you care about performance so much, think about how to calculate gradients fast (symbolically / algebraic; no hidden numerical-differentiation as being done in your SLSQP-based code).

Hint: compare SLSQP's nfev with nit to see the overhead of numerical-differentiation!

Approach

You are only interested in optimizing a convex-combination over two variables
- This is easily mapped onto scalar optimization
- Optimize W1 (bounded) and W2 is implied!

Code (unoptimized hack)

Original parts

from scipy.optimize import minimize
import numpy
numpy.random.seed(0)
from time import perf_counter as pc

"""
ORIGNAL CODE
"""

def objective(W,X1,X2,A):
    W1=W[0]
    W2=W[1]
    X=numpy.hstack((W1*X1,W2*X2))
    return numpy.sum((X-A)**2)

def constraint1(W):
    W1=W[0]
    W2=W[1]
    return W1+W2-1

x0=[[1,0]]
cons = {'type': 'eq', 'fun':constraint1}

#Random data only used for purposes of example   
segment_1 = numpy.random.rand(1000, 20000)
segment_2 = numpy.random.rand(1000, 20000)
A = numpy.random.rand(1000, 40000)

# MODIFICATION: make instance more interesting! != ~0.5/0.5 solution
segment_1 *= 1.3

start_time = pc()
sol = minimize(objective,x0[0],args=(segment_1,segment_2,A),method='SLSQP',constraints=cons)
end_time = pc()
print(sol)
print('secs: {}'.format(end_time - start_time))

Modified parts (same source/file as above)

"""
ALTERNATIVE CODE hacked around ORIGINAL CODE
"""

from scipy.optimize import minimize_scalar
def solve_alternative(ojective_func, segment_1, segment_2, A):
  objective_func_wrapper = lambda x: ojective_func(
    (x, 1.0-x), segment_1, segment_2, A)

  x = minimize_scalar(objective_func_wrapper, method='Bounded', bounds=(0, 1))
  return x

start_time = pc()
sol = solve_alternative(objective, segment_1, segment_2, A)
end_time = pc()
print(sol)
print('secs: {}'.format(end_time - start_time))

Example output (dimensions scaled down to running this on tablet-like device)

    fun: 6280656.612055224
    jac: array([-2736633.5  , -2736633.375])
message: 'Optimization terminated successfully.'
    nfev: 19
    nit: 3
    njev: 3
  status: 0
success: True
      x: array([0.45541614, 0.54458386])
secs: 32.6327816

    fun: 6280656.612055217
message: 'Solution found.'
    nfev: 6
  status: 0
success: True
      x: 0.45541616584551015
secs: 9.930487200000002

Thanks for the reply, would this approach change much if the matrix X was composed of 4 variables of size (10000,10000) instead of two variables of size (10000,20000)? — user1361488, Jan 02 '21 at 13:51
Yes. It wouldn't work as the mapping we have done only reduced the dimensionality `N -> N-1 = 2 -> 1` to obtain scalar-optimization. With 4, you would need multivariate-optimization again. — sascha, Jan 02 '21 at 13:53
Ok, is slsqp the best approach to use, in this case, for multivariate-optimization? — user1361488, Jan 02 '21 at 13:57

score 1 · Accepted Answer · answered Jan 03 '21 at 16:19

While i exploited that there are only two variables in the other answer, here we focus on doing efficient function-evaluation!

Exploiting the inherent simplicity of the objective allows to reuse your original SLSQP-based optimization while being ready for additional segments / variables in the future (as indicated in a comment) as long as the structure stays the same.

The optimization-cost should approximately be equal to the cost of a single function-evaluation!

Idea

Original function is as inefficient as possible due to potential memory-allocation and array-copies (e.g. np.stack())
- Could be improved by storage-order parallel evaluation without memory-ops (see code)
In general very costly to evaluate due to huge data involved (in each iteration of some optimizer)
- Reformulation possible!
  - Do as many calculations as possible a-priori to speed-up calculations depending on the optimization-variables only!

Reformulation is basically following this from WolframAlpha:

Remark:

The task is presented in matrix-form, but it's really just a flattened / element-wise calculation
- This has been exploited directly before asking WolframAlpha for ideas: see input!
For comparison:
- W1 = x
- W2 = y
- X1 = v_i (but 1d)
- X2 = w_i (but 1d)
- A = a_i and b_i (decomposed + 1d)

Code

import numpy as np
from scipy.optimize import minimize
from time import perf_counter as pc
np.random.seed(0)

# random fake-data
# ################
segment_1 = np.random.rand(5000, 10000) * 7.13
segment_2 = np.random.rand(5000, 10000) * np.random.normal(size=(5000, 10000))
A = np.random.rand(5000, 20000)

# constraint: probability-simplex
# ###############################
def constraint(x):
    return np.sum(x) - 1.

# objective
# #########

# original -> very inefficient due to lots of potential memcopy
# -------------------------------------------------------------
def objective(x):
    W1=x[0]
    W2=x[1]
    X=np.hstack((W1*segment_1, W2*segment_2))
    return np.sum((X-A)**2)

# modified -> (hopefully) no memory-allocation at all; (hopefully) storage-order parallel iteration 
# -------------------------------------------------------------------------------------------------
def objective_perf(x):
    return np.sum( ((x[0] * segment_1) - A[:, :segment_1.shape[1]])**2 ) \
        +  np.sum( ((x[1] * segment_2) - A[:, segment_1.shape[1]:])**2 )

# heavily reformulated
# ####################

start_time = pc()

# pre-calc: flatten out matrices as we are doing element-wise reasoning anyway
flat_A_segment_A = A[:, :segment_1.shape[1]].ravel()
flat_A_segment_B = A[:, segment_1.shape[1]:].ravel()
flat_segment_A = segment_1.ravel()
flat_segment_B = segment_2.ravel()

# pre-calc: sub-expressions (see WolframAlpha image!) / sum_squares(vec) = np.dot(vec, vec)
comp_0 = np.dot(flat_A_segment_A, flat_A_segment_A) + np.dot(flat_A_segment_B, flat_A_segment_B)
comp_1 = -2 * np.dot(flat_A_segment_A, flat_segment_A)
comp_2 = -2 * np.dot(flat_A_segment_B, flat_segment_B)
comp_3 = np.dot(flat_segment_A, flat_segment_A)
comp_4 = np.dot(flat_segment_B, flat_segment_B)

end_time = pc()
print('pre-calc secs: {}\n'.format(end_time - start_time))

# pre-calc based function-eval / basically *for free*
def objective_high_perf(x):
  return comp_0 + x[0] * comp_1 + x[1] * comp_2 + x[0]**2 * comp_3 + x[1]**2 * comp_4

# SLSQP-based solving
# -------------------

cons = {'type': 'eq', 'fun': constraint}
x0 = [1.0, 0.0]

print('-----\nVariant: "objective"\n-----')
start_time = pc()
sol = minimize(objective_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))

print('-----\nVariant: "objective_perf"\n-----')
start_time = pc()
sol = minimize(objective_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))

print('-----\nVariant: "objective_high_perf"\n-----')
start_time = pc()
sol = minimize(objective_high_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))

Output (not on full data due to memory on tablet-like device)

pre-calc secs: 1.1280025999999999

-----
Variant: "objective"
-----
    fun: 37044840.62293503
    jac: array([29253964., 29253786.])
message: 'Optimization terminated successfully'
    nfev: 16
    nit: 2
    njev: 2
  status: 0
success: True
      x: array([0.12245548, 0.87754452])
secs: 49.2468216

-----
Variant: "objective_perf"
-----
    fun: 37044840.62293503
    jac: array([29253964., 29253786.])
message: 'Optimization terminated successfully'
    nfev: 16
    nit: 2
    njev: 2
  status: 0
success: True
      x: array([0.12245548, 0.87754452])
secs: 49.036501799999996

-----
Variant: "objective_high_perf"
-----
    fun: 37044840.622934975
    jac: array([29253784. , 29253777.5])
message: 'Optimization terminated successfully'
    nfev: 15
    nit: 2
    njev: 2
  status: 0
success: True
      x: array([0.12245547, 0.87754453])
secs: 0.010043600000003039

Expectation

I would guess your 10 minute run should be < 10 secs now.

In my example, ~50 secs have been reduced to ~1.13 + ~0.01 = ~1.14 secs

Great answer, thanks, one more thing: in the wolfram alpha equation, why do you need three variables each to describe X1,X2,A,B. i.e. v0,v1,v2,w0,w1,w2,a0,a1,a2,b0,b1,b2? — user1361488, Jan 03 '21 at 20:05
I don't need those. I just used those as hard-coded /fixed-dimension example to reason about potential reformulations automatically proposed. So the general idea is just; more than one + not too much (as there is a limit on the service) — sascha, Jan 04 '21 at 00:53
Ok, I tested the high performance answer on real-world data and it works like a charm. Thanks again for your help. — user1361488, Jan 04 '21 at 08:53