3

I have an iterative problem where at each iteration I need to perform a series of matrix operations, which I want to speed up. The majority of the calculation involves constant variables, except for a varying one. My question is: is it possible to build a "skeleton" of this operation, as to speed up the calculation in the iterations after the first? (somehow similar to the parametric warm start of the optimization problem in CVXPY). The code I would need to work on is currently within a dedicated function in a Class, and it is something like this:

# some constants definitions (in a class)
m = 5501
Y = np.random.rand(10, m)
U = np.random.rand(75, m)
sig_on = 0.15
sig_off = 0.25
a = 15
L = 25

def obtain_A_B(x):
   # for readability I omit internal variable declarations as a = self.a
   varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
   F = varying_part * np.identity(m) + Y.T @ Y
   Fi = np.linalg.inv(F)
   UFiUTi = np.linalg.inv(U @ Fi @ U.T)
   A = (Fi - Fi @ U.T @ UFiUTi  @ U @ Fi) @ Y.T
   B = Fi @ U.T @ UFiUTi 
   return A, B

x = np.random.rand(10)
result = obtain_A_B(x)

I am interested in speeding up the computation of A and B, and 'varying_part' returns a scalar which changes at each iteration given x. Since the really time-consuming part here is computing the inverse Fi, already just speeding that computation up would be a great help.

I have already moved the inverses to dedicated variables (as written above) to reduce the number of times these need to be computed - that brought down the execution time from ~41s to ~10s - but it would be best if I could make the execution even faster (that's where the "operation-skeleton" idea came to mind).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
Andrea
  • 31
  • 3
  • Well, for one you're computing `np.linalg.inv(U @ Fi @ U.T)` twice – actually, `U.T @ np.linalg.inv(U @ Fi @ U.T)` could probably be cached. – AKX Jan 09 '23 at 10:18
  • But as usual: figure out what the slow part is (with a profiler) before trying to optimize things. – AKX Jan 09 '23 at 10:19
  • The time consuming part is still the inverse - 6s, the rest is ~4s and computing `np.linalg.inv(U @ Fi @ U.T)` takes ~0.093s (as expected, not the issue). I already worked on this problem in matlab and I throughly investigated the timing then: with increasing problem size (increasing m) the main issue is the inverse – Andrea Jan 09 '23 at 10:42
  • Well, you could try [`numba`](https://numba.pydata.org/) since all the function does is Numpy operations... – AKX Jan 09 '23 at 11:05
  • I changed the code so to make it reproducible while keeping the same logic (feel free to change it if there is anything wrong). As for Numba, it is not very useful here unless one write a faster code than the one of optimized BLAS implementation which is possible but far from being easy. – Jérôme Richard Jan 09 '23 at 22:00

1 Answers1

1

Basic optimizations

First of all, the expression of B is use in A so B can be computed first and reused in the expression of A.

Moreover, varying_part * np.identity(m) + Y.T @ Y can be computed more efficiently. Indeed, varying_part * np.identity(m) creates 2 big temporary arrays that are slow to fill. This is because the RAM is slow nowadays compared to the performance of the CPU (see "Memory Wall"). You can just update the diagonal of the array instead.

Additionally, it is significantly faster to compute B @ (U @ Fi) than (B @ U) @ Fi (which is equivalent to B @ U @ Fi) while both are analytically equivalent since the matrix multiplication is associative (though a bit different numerically).

Here is the optimized implementation:

def obtain_A_B_opt(x):
   varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
   F = Y.T @ Y
   np.fill_diagonal(F, F.diagonal()+varying_part)
   Fi = np.linalg.inv(F)
   tmp = Fi @ U.T
   UFiUTi = np.linalg.inv(U @ tmp)
   B = tmp @ UFiUTi 
   A = (Fi - B @ (U @ Fi)) @ Y.T
   return A, B

This optimization results in a 55% faster code on my machine. Nearly 90% of the time is spent in the np.linalg.inv(F) function call.

The bad news is that the matrix inversion can very hardly be optimized on the CPU. The default OpenBLAS library used to perform this computation in Numpy is sub-optimal, but it is relatively good since it runs in parallel using a relatively optimized native C code. Moreover, changing the diagonal strongly impact the computation so I do not thing this is possible to write an incremental computation of the inverse. Indeed, increasing F[0,0] by 1 causes the weight to of all lines to be mandatory recomputed in the Gauss Jordan algorithm (and so all following weight computations).

It is hard to write a much faster code without reducing the floating-point precision or the accuracy of the algorithm (eg. approximation). GPUs can significantly help if you have a server GPU or if it is fine for you to reduce the precision.


GPU-based implementation

One can use GPUs to make this code faster regarding the target platform and the input floating-point type. Discrete GPUs tend to have a significantly bigger computational power than mainstream CPUs. However, this is not true for double-precision on mainstream PC GPUs. Indeed, PC GPUs are optimized for simple-precision floating point data (commonly used in games and 3D applications). For fast double-precision computations, a server-side GPU is needed. Besides, one should keep in mind that data transfers between the CPU and the GPU are expensive, but the matrix inversion is the bottleneck and it runs in O(N**3) so data transfers are negligible here. Cupy can be used to easily write a GPU implementation (for Nvidia GPUs). Here is the resulting GPU code:

import cupy as cp

def obtain_A_B_gpu_opt(x):
   varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off

   # Send data to the GPU
   device_Y = cp.array(Y)
   device_U = cp.array(U)

   # GPU-based computation
   F = device_Y.T @ device_Y
   cp.fill_diagonal(F, F.diagonal()+varying_part)
   Fi = cp.linalg.inv(F)
   tmp = Fi @ device_U.T
   UFiUTi = cp.linalg.inv(device_U @ tmp)
   B = tmp @ UFiUTi 
   A = (Fi - B @ (device_U @ Fi)) @ device_Y.T

   # Get data back from the GPU
   return A.get(), B.get()

Note that Cupy may be quite slow during the first execution, but it should be significantly faster then (which is fine for an iterative loop here).


Experimental results

Here are results on my machine with a i5-9600KF CPU and a Nvidia 1660 Super GPU (mainstream discrete PC GPU):

double-precision:
 - obtain_A_B:          4.10 s
 - obtain_A_B_opt:      2.64 s
 - obtain_A_B_gpu_opt:  3.03 s

simple-precision:
 - obtain_A_B:          3.17 s
 - obtain_A_B_opt:      2.52 s
 - obtain_A_B_gpu_opt:  0.28 s  <----

One can see that the GPU-based implementation is significantly faster with simple-precision floating-point inputs. This is because my GPU is optimized for this. On a computing server, obtain_A_B_gpu_opt should be significantly faster than other implementations. Note that cp.linalg.inv still takes >90% of the time on my GPU.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59