Linear least-squares solution for 3d inputs

Question

Problem

Say I have two arrays with the following shapes:

y.shape is (z, b). Picture this as a collection of z (b,) y vectors.
x.shape is (z, b, c). Picture this as a collection of z (b, c) multivariate x matrices.

My intent is to find the z independent vectors of least-squares coefficient solutions. I.e. the first solution is from regressing y[0] on x[0], where those inputs have shape (b, ) and (b, c) respectively. (b observations, c features.) The result would be shape (z, c).

Some example data

np.random.seed(123)
x = np.random.randn(190, 20, 3)
y = np.random.randn(190, 20)  # Assumes no intercept term

# First vector of coefficients
np.linalg.lstsq(x[0], y[0])[0]
# array([-0.12823781, -0.3055392 ,  0.11602805])

# Last vector of coefficients
np.linalg.lstsq(x[-1], y[-1])[0]
# array([-0.02777503, -0.20425779,  0.22874169])

NumPy's least-squares solver lstsq can't operate on these. (With my intended result being shape (190, 3), or 190 vectors of 3 coefficients each. Each (3,) vector is one coefficient set from regressions with n=20.)

Is there a workaround to get to the coefficient matrices wrapped into one result array? I'm thinking possibly of the matrix formulation:

For a 1d y and 2d x this would just be:

def coefs(y, x):
    return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))

but I'm having trouble getting this to accept a 2d y and 3d x as above.

Lastly, I'm curious as to why lstsq has trouble here. Is there a simple answer as to why the inputs must be at most 2d?

I think you should focus more on explaining your task, not throwing shapes and target-shapes around. This image also feels out of context. Maybe add some small example too. If you can build your objective as function, you can always use scipy's leastsq, or maybe least_squares. But knowing the task would help to analyze the lstsq approach. — sascha, Oct 09 '17 at 19:17
It seems not that hard to transform this to something which can be used by lstsq; but i don't recommend it. If these are *independent regressions*, why not just calculate them independently. Combining these does not help in some computational way (more the opposite; especially when using a dense-solver) and only make things more complex. So where does this need / idea come from? — sascha, Oct 09 '17 at 20:15
@sascha comes mainly from the fact that looping through `z` calls to `lstsq` seems inefficient compared to something like getting `coefs` above to operate elementwise on 2d/3d arrays. But not knowing internals well at all, I could be wrong. — Brad Solomon, Oct 09 '17 at 20:27
Imho: your problem here are the assumptions taken by lstsq. It's build for some specific case and yours is not compatible per se. You would need to build this information in one big A mat (a lot of zeros introduced); which will be much slower, even using a sparse solver (lstsq is a dense solver). While python-looping is not fast; this will be dominated by the costs of the inner lstsq calls. — sascha, Oct 09 '17 at 20:32

sascha · Accepted Answer · 2017-10-09T22:25:43.660

Here is some demo to demonstrate:

the problems mentioned in my comments
a mostly empirical analysis of looped-lstsq vs. one-step-embedded-lstsq
(with some surprising result at the end which is to be taken with a grain of salt):

Code

import numpy as np
import scipy.sparse as sp
from sklearn.datasets import make_regression
from time import perf_counter as pc
np.set_printoptions(edgeitems=3,infstr='inf',
                    linewidth=160, nanstr='nan', precision=1,
                    suppress=False, threshold=1000, formatter=None)

""" Create task """
Z, B, C = 4, 3, 2

Zs = []
Bs = []
for i in range(Z):
    X, y, = make_regression(n_samples=B, n_features=C, random_state=i)
    Zs.append(X)
    Bs.append(y)
Zs = np.array(Zs)
Bs = np.array(Bs)

""" Independent looping """
print('LOOPED CALLS')
start = pc()
result = np.empty((Z, C))
for z in range(Z):
    result[z] = np.linalg.lstsq(Zs[z], Bs[z])[0]
end = pc()
print('lhs-shape: ', Zs.shape)
print('lhs-dense-fill-ratio: ', np.count_nonzero(Zs) / np.product(Zs.shape))
print('used time: ', end-start)
print(result)

""" Embedding in one """
print('EMBEDDING INTO ONE CALL')
Zs_ = sp.block_diag([Zs[i] for i in range(Z)]).todense()  # convenient to use scipy.sparse
                                                          # oops: there is a dense-one too: 
                                                          # -> scipy.linalg.block_diag
Bs_ = Bs.flatten()

start = pc()  # one could argue if transform above should be timed too!
result_ = np.linalg.lstsq(Zs_, Bs_)[0]
end = pc()
print('lhs-shape: ', Zs_.shape)
print('lhs-dense-fill-ratio: ', np.count_nonzero(Zs_) / np.product(Zs_.shape))
print('used time: ', end-start)
print(result_)

Output

LOOPED CALLS
lhs-shape:  (4, 3, 2)
lhs-dense-fill-ratio:  1.0
used time:  0.0005415275241778155
[[ 89.2  43.8]
 [ 68.5  41.9]
 [ 61.9  20.5]
 [  5.1  44.1]]
EMBEDDING INTO ONE CALL
lhs-shape:  (12, 8)
lhs-dense-fill-ratio:  0.25
used time:  0.00015907748341232328
[ 89.2  43.8  68.5  41.9  61.9  20.5   5.1  44.1]

lstsq problem-dimensions for each case

While the original data looks like:

[[[ 2.2  1. ]
  [-1.   1.9]
  [ 0.4  1.8]]

 [[-1.1 -0.5]
  [-2.3  0.9]
  [-0.6  1.6]]

 [[ 1.6 -2.1]
  [-0.1 -0.4]
  [-0.8 -1.8]]

 [[-0.3 -0.4]
  [ 0.1 -1.9]
  [ 1.8  0.4]]]
[[ 242.7   -5.4  112.9]
 [ -95.7 -121.4   26.2]
 [  57.9  -12.   -88.8]
 [ -17.1  -81.6   28.4]]

and each solve looks like:

LHS
[[ 2.2  1. ]
 [-1.   1.9]
 [ 0.4  1.8]]
RHS
[ 242.7   -5.4  112.9]

the embedded problem (one solving-step) looks like:

LHS
[[ 2.2  1.   0.   0.   0.   0.   0.   0. ]
 [-1.   1.9  0.   0.   0.   0.   0.   0. ]
 [ 0.4  1.8  0.   0.   0.   0.   0.   0. ]
 [ 0.   0.  -1.1 -0.5  0.   0.   0.   0. ]
 [ 0.   0.  -2.3  0.9  0.   0.   0.   0. ]
 [ 0.   0.  -0.6  1.6  0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   1.6 -2.1  0.   0. ]
 [ 0.   0.   0.   0.  -0.1 -0.4  0.   0. ]
 [ 0.   0.   0.   0.  -0.8 -1.8  0.   0. ]
 [ 0.   0.   0.   0.   0.   0.  -0.3 -0.4]
 [ 0.   0.   0.   0.   0.   0.   0.1 -1.9]
 [ 0.   0.   0.   0.   0.   0.   1.8  0.4]]
RHS
[ 242.7   -5.4  112.9  -95.7 -121.4   26.2   57.9  -12.   -88.8  -17.1  -81.6   28.4]

There is no way, given the assumptions / standard-form of lstsq to embed this independence-assumption without introducing a lot of zeros!

lstsq is:

not able to exploit sparsity as the core-algorithm is a dense-one
- take a look at the transformed shape: this will be heavy in terms of memory and computation!
not able to use information from fit 0 to speed up something in fit 1
- they are independent after all; no information gain in theory
able to vectorize a lot (but that's not helping in general)

Your example-shapes

Trimmed output for your specific shapes, this time: test a sparse-solver too:

Added code (at the end)

print('EMBEDDING -> sparse-solver')
Zs_ = sp.csc_matrix(Zs_)  # sparse!
start = pc()
result__ = sp.linalg.lsmr(Zs_, Bs_)[0]
end = pc()
print('lhs-shape: ', Zs_.shape)
print('lhs-dense-fill-ratio: ', Zs_.nnz / np.product(Zs_.shape))
print('used time: ', end-start)
print(result__)

Output

LOOPED CALLS
lhs-shape:  (190, 20, 3)
lhs-dense-fill-ratio:  1.0
used time:  0.01716980329027777

[ 11.9  31.8  29.6]
...
[ 44.8  28.2  62.3]]


EMBEDDING INTO ONE CALL
lhs-shape:  (3800, 570)
lhs-dense-fill-ratio:  0.00526315789474
used time:  0.6774500271820254
[ 11.9  31.8  29.6 ... 44.8  28.2  62.3]

EMBEDDING -> sparse-solver
lhs-shape:  (3800, 570)
lhs-dense-fill-ratio:  0.00526315789474
used time:  0.0038423098412817547            # a bit of a surprise
[ 11.9  31.8  29.6 ...  44.8  28.2  62.3]

Conclusion

In general: solve independently!

In some cases, the task above will be solved faster when using the sparse-solver approach, but analysis here is hard as we are comparing two completely different algorithms (direct vs. iterative) and the results might change in some dramatical way for other data.

Bounty pts. coming your way when I am able. I do have one lingering question. Is the matrix method mentioned in my question feasible for expanded-dimension inputs like these? I'm thinking, for instance, of how [`np.matmul`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.matmul.html) is able to treat 3d inputs as stacks of 2d matrices and that the `coefs` method defined in my question might work if I manipulate the shape of the inputs a bit. Just a shot in the dark. — Brad Solomon, Oct 09 '17 at 23:37
That's hard to answer and internals are complex. The hidden magic of lstsq is singular value decomposition, which again is built on the standard-form of ```||Ax - b||```. This part, the SVD is the most important step and we can't do much here using this approach. Embedding is just hurting here as this is still dense and we increase dimensions. To summarize: lstsq is not just some matrix-vector or vector-vector products, but explicitly calling [dgelsd](http://www.netlib.org/lapack/explore-3.1.1-html/dgelsd.f.html). Now i don't know what kind of other closed-form approach you are looking for. — sascha, Oct 09 '17 at 23:45
Maybe there is some alternative form with it's own advantages and disadvantages. But in general: we already now a decomposition / independent calculations. Every tuned approach will use that fact! (in the experiment above, i would expect the iterative-approach used in sparse-case to be even better when using the loop-approach; for some not tiny instances). Everything somewhat depends on your data and exact task. If an outer-loop is hurting, you should look into cython and co. to remove that bottleneck. But i highly recommend benchmarking first to make sure nothing irrelevant is optimized. — sascha, Oct 09 '17 at 23:47
For the single lstsq (dense; not embedded) instance using direct-algorithms, the SVD-approach is probably as fast as it gets (LAPACK is highly optimized: algorithms, SIMD, caching). There might be other algorithms more suited for other data (as slightly indicated by the sparse-experiment above). But that needs more analysis. — sascha, Oct 09 '17 at 23:50
I also don't think your matrix-approach helps much (still needs inefficient embedding). Did you check out [wikis overview](https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) which has the same formula (```Inverting the matrix of the normal equations```; including some comments about non-optimal empirical performance in some cases), followed by the SVD-based approach (```Orthogonal decomposition methods```) used by numpy. But these approaches are all closed-form approaches. Alternative methods like first-order methods behave differently and a parallel gradient-calc is a good idea — sascha, Oct 09 '17 at 23:56

Brad Solomon · Answer 2 · 2017-10-13T19:01:24.687

Here is the linear algebra solution, with the speed right on par with @sascha's looped version for smaller arrays.

print('Matrix formulation')
start = pc()
result = np.squeeze(np.matmul(np.linalg.inv(np.matmul(Zs.swapaxes(1,2), Zs)),
                    np.matmul(Zs.swapaxes(1,2), np.atleast_3d(Bs))))
end = pc()
print('used time: ', end-start)
print(result)

Ouput:

Matrix formulation
used time:  0.00015713176480858237
[[ 89.2  43.8]
 [ 68.5  41.9]
 [ 61.9  20.5]
 [  5.1  44.1]]

However, @sascha's answer wins out easily for much larger inputs, especially as the size of the third dimension grows (number of exogenous variables/features).

Z, B, C = 400, 300, 20

Zs = []
Bs = []
for i in range(Z):
    X, y, = make_regression(n_samples=B, n_features=C, random_state=i)
    Zs.append(X)
    Bs.append(y)
Zs = np.array(Zs)
Bs = np.array(Bs)

# --------

print('Matrix formulation')
start = pc()

result = np.squeeze(np.matmul(np.linalg.inv(np.matmul(Zs.swapaxes(1,2), Zs)),
                    np.matmul(Zs.swapaxes(1,2), np.atleast_3d(Bs))))

end = pc()
print('used time: ', end-start)
print(result)

# --------

print('Looped calls')
start = pc()

result = np.empty((Z, C))
for z in range(Z):
    result[z] = np.linalg.lstsq(Zs[z], Bs[z])[0]

end = pc()
print('used time: ', end-start)
print(result)

Output:

Matrix formulation
used time:  0.24000779996413257
[[  1.2e+01   1.3e-15   6.3e+01 ...,  -8.9e-15   5.3e-15  -1.1e-14]
 [  5.8e+01   2.7e-14  -4.8e-15 ...,   8.5e+01  -1.5e-14   1.8e-14]
 [  1.2e+01  -1.2e-14   4.4e-16 ...,   6.0e-15   8.6e+01   6.0e+01]
 ..., 
 [  2.9e-15   6.6e+01   1.1e-15 ...,   9.8e+01  -2.9e-14   8.4e+01]
 [  2.8e+01   6.1e+01  -1.2e-14 ...,  -2.5e-14   6.3e+01   5.9e+01]
 [  7.0e+01   3.3e-16   8.4e+00 ...,   4.1e+01  -6.2e-15   5.8e+01]]
Looped calls
used time:  0.17400113389658145
[[  1.2e+01   7.1e-15   6.3e+01 ...,  -2.8e-14   1.1e-14  -4.8e-14]
 [  5.8e+01  -5.7e-14  -4.9e-14 ...,   8.5e+01  -5.3e-15   6.8e-14]
 [  1.2e+01   3.6e-14   4.5e-14 ...,  -3.6e-15   8.6e+01   6.0e+01]
 ..., 
 [  6.3e-14   6.6e+01  -1.4e-13 ...,   9.8e+01   2.8e-14   8.4e+01]
 [  2.8e+01   6.1e+01  -2.1e-14 ...,  -1.4e-14   6.3e+01   5.9e+01]
 [  7.0e+01  -1.1e-13   8.4e+00 ...,   4.1e+01  -9.4e-14   5.8e+01]]