Basic optimizations
First of all, the expression of B
is use in A
so B
can be computed first and reused in the expression of A
.
Moreover, varying_part * np.identity(m) + Y.T @ Y
can be computed more efficiently. Indeed, varying_part * np.identity(m)
creates 2 big temporary arrays that are slow to fill. This is because the RAM is slow nowadays compared to the performance of the CPU (see "Memory Wall"). You can just update the diagonal of the array instead.
Additionally, it is significantly faster to compute B @ (U @ Fi)
than (B @ U) @ Fi
(which is equivalent to B @ U @ Fi
) while both are analytically equivalent since the matrix multiplication is associative (though a bit different numerically).
Here is the optimized implementation:
def obtain_A_B_opt(x):
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
F = Y.T @ Y
np.fill_diagonal(F, F.diagonal()+varying_part)
Fi = np.linalg.inv(F)
tmp = Fi @ U.T
UFiUTi = np.linalg.inv(U @ tmp)
B = tmp @ UFiUTi
A = (Fi - B @ (U @ Fi)) @ Y.T
return A, B
This optimization results in a 55% faster code on my machine. Nearly 90% of the time is spent in the np.linalg.inv(F)
function call.
The bad news is that the matrix inversion can very hardly be optimized on the CPU. The default OpenBLAS library used to perform this computation in Numpy is sub-optimal, but it is relatively good since it runs in parallel using a relatively optimized native C code. Moreover, changing the diagonal strongly impact the computation so I do not thing this is possible to write an incremental computation of the inverse. Indeed, increasing F[0,0]
by 1 causes the weight to of all lines to be mandatory recomputed in the Gauss Jordan algorithm (and so all following weight computations).
It is hard to write a much faster code without reducing the floating-point precision or the accuracy of the algorithm (eg. approximation). GPUs can significantly help if you have a server GPU or if it is fine for you to reduce the precision.
GPU-based implementation
One can use GPUs to make this code faster regarding the target platform and the input floating-point type. Discrete GPUs tend to have a significantly bigger computational power than mainstream CPUs. However, this is not true for double-precision on mainstream PC GPUs. Indeed, PC GPUs are optimized for simple-precision floating point data (commonly used in games and 3D applications). For fast double-precision computations, a server-side GPU is needed. Besides, one should keep in mind that data transfers between the CPU and the GPU are expensive, but the matrix inversion is the bottleneck and it runs in O(N**3)
so data transfers are negligible here. Cupy can be used to easily write a GPU implementation (for Nvidia GPUs). Here is the resulting GPU code:
import cupy as cp
def obtain_A_B_gpu_opt(x):
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
# Send data to the GPU
device_Y = cp.array(Y)
device_U = cp.array(U)
# GPU-based computation
F = device_Y.T @ device_Y
cp.fill_diagonal(F, F.diagonal()+varying_part)
Fi = cp.linalg.inv(F)
tmp = Fi @ device_U.T
UFiUTi = cp.linalg.inv(device_U @ tmp)
B = tmp @ UFiUTi
A = (Fi - B @ (device_U @ Fi)) @ device_Y.T
# Get data back from the GPU
return A.get(), B.get()
Note that Cupy may be quite slow during the first execution, but it should be significantly faster then (which is fine for an iterative loop here).
Experimental results
Here are results on my machine with a i5-9600KF CPU and a Nvidia 1660 Super GPU (mainstream discrete PC GPU):
double-precision:
- obtain_A_B: 4.10 s
- obtain_A_B_opt: 2.64 s
- obtain_A_B_gpu_opt: 3.03 s
simple-precision:
- obtain_A_B: 3.17 s
- obtain_A_B_opt: 2.52 s
- obtain_A_B_gpu_opt: 0.28 s <----
One can see that the GPU-based implementation is significantly faster with simple-precision floating-point inputs. This is because my GPU is optimized for this. On a computing server, obtain_A_B_gpu_opt
should be significantly faster than other implementations. Note that cp.linalg.inv
still takes >90% of the time on my GPU.