I'm doing some vectorized algebra using numpy
and the wall-clock performance of my algorithm seems weird. The program does roughly as follows:
- Create three matrices:
Y
(KxD),X
(NxD),T
(KxN) - For each row of
Y
: - subtract
Y[i]
from each row ofX
(by broadcasting), - square the differences along one axis, sum them, take a square root, then store in
T
.
However, depending on how I perform the broadcasting, computation speed is vastly different. Consider the code:
import numpy as np
from time import perf_counter
D = 128
N = 3000
K = 500
X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))
if True: # negate to enable the second loop
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
T[i] = np.sqrt(np.sum(
np.square(
X - Y[i] # this has dimensions NxD
),
axis=1
))
time += perf_counter() - start
print("Broadcast in line: {:.3f} s".format(time / 100))
exit()
if True:
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(
np.square(
diff
),
axis=1
))
time += perf_counter() - start
print("Broadcast out: {:.3f} s".format(time / 100))
exit()
Times for each loop are measured individually and averaged over 100 executions. The results:
Broadcast in line: 1.504 s
Broadcast out: 0.438 s
The only difference is that broadcasting and subtraction in the first loop is done in-line, while in the second approach I do it before any vectorized operations. Why is this making such a difference?
My system configuration:
- Lenovo ThinkStation P920, 2x Xeon Silver 4110, 64 GB RAM
- Xubuntu 18.04.2 LTS (bionic)
- Python 3.7.3 (GCC 7.3.0)
- Numpy 1.16.3 linked against OpenBLAS (that's as much as
np.__config__.show()
tells me)
PS: Yes I am aware this could be further optimized, but right now I would like to understand what happens under the hood here.