Broadcasting outside main loop speeds up vectorized numpy ops?

Question

I'm doing some vectorized algebra using numpy and the wall-clock performance of my algorithm seems weird. The program does roughly as follows:

Create three matrices: Y (KxD), X (NxD), T (KxN)
For each row of Y:
subtract Y[i] from each row of X (by broadcasting),
square the differences along one axis, sum them, take a square root, then store in T.

However, depending on how I perform the broadcasting, computation speed is vastly different. Consider the code:

import numpy as np
from time import perf_counter

D = 128
N = 3000
K = 500

X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))

if True: # negate to enable the second loop
    time = 0.0
    for i in range(100):
        start = perf_counter()
        for i in range(K):
            T[i] = np.sqrt(np.sum(
                np.square(
                  X - Y[i] # this has dimensions NxD
                ),
                axis=1
            ))
        time += perf_counter() - start
    print("Broadcast in line: {:.3f} s".format(time / 100))
    exit()

if True:
    time = 0.0
    for i in range(100):
        start = perf_counter()
        for i in range(K):
            diff = X - Y[i]
            T[i] = np.sqrt(np.sum(
                np.square(
                  diff
                ),
                axis=1
            ))
        time += perf_counter() - start
    print("Broadcast out:     {:.3f} s".format(time / 100))
    exit()

Times for each loop are measured individually and averaged over 100 executions. The results:

Broadcast in line: 1.504 s
Broadcast out:     0.438 s

The only difference is that broadcasting and subtraction in the first loop is done in-line, while in the second approach I do it before any vectorized operations. Why is this making such a difference?

My system configuration:

Lenovo ThinkStation P920, 2x Xeon Silver 4110, 64 GB RAM
Xubuntu 18.04.2 LTS (bionic)
Python 3.7.3 (GCC 7.3.0)
Numpy 1.16.3 linked against OpenBLAS (that's as much as np.__config__.show() tells me)

PS: Yes I am aware this could be further optimized, but right now I would like to understand what happens under the hood here.

Are these measurements consistent? What's the average of running this script, say, a hundred times? Does the "inline" method get faster if you swap the two loops? — ForceBru, Aug 01 '19 at 15:15
@ForceBru Updated the results and code. Results stay consistent even if I only use one method for each execution of the script. — Przemek D, Aug 01 '19 at 15:25
just hacked that into a `timeit` call, with `for i in range(K):` and not `for i in range(100):`. The in line method is only ~8% faster on my machine... — FObersteiner, Aug 01 '19 at 15:44
You aren't moving `X-Y[i]` outside the `i` loop. I get about a 20% speedup. I don't know for sure, but I suspect the speed differences are linked to underlying memory management, issues that we don't have direct control over. I can do the same thing without a loop: `np.sqrt(np.sum( np.square(X-Y[:,None,:]), axis=2)) `, but timings are a bit slower. — hpaulj, Aug 01 '19 at 16:42
I tested this on a different machine (Windows, just added my specs to the post) and both loops run in roughly the same time. This is worrying, because the dimensions could be much larger and we'd be talking about minutes of difference, not fractions of a second. There must be some explanation to the mysterious *underlying memory management* - obviously [some abstraction is leaking](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/) on me and I'd love to know what it is... — Przemek D, Aug 02 '19 at 09:21
I'm not getting performance differences, Broadcast out: 1.909 s, vs Broadcast in line: 1.874s, but I'm on windows (same python and numpy version) — Krupip, Aug 05 '19 at 17:16
@PrzemekD Was the Windows machine also a Dual Socket System? — max9111, Aug 09 '19 at 12:20
@max9111 Yes, 2x Xeon E5620 (or similar), 64GB RAM as well. Would have to check Numpy version though. — Przemek D, Aug 09 '19 at 15:54
I couldn't reproduce this on a Desktop Windows system, but on a Dual Socket Linux. — max9111, Aug 09 '19 at 18:11

score 1 · Answer 1 · answered Aug 09 '19 at 18:09

It's not a broadcasting problem

I also added a optimized solution to see how long the actual calculation takes without the large overhead of memory allocation and deallocation.

Functions

import numpy as np
import numba as nb

def func_1(X,Y,T):
    for i in range(K):
        T[i] = np.sqrt(np.sum(np.square(X - Y[i]),axis=1))
    return T

def func_2(X,Y,T):
    for i in range(K):
        diff = X - Y[i]
        T[i] = np.sqrt(np.sum(np.square(diff),axis=1))
    return T

@nb.njit(fastmath=True,parallel=True)
def func_3(X,Y,T):
    for i in nb.prange(Y.shape[0]):
        for j in range(X.shape[0]):
            diff_sq_sum=0.
            for k in range(X.shape[1]):
                diff_sq_sum+= (X[j,k] - Y[i,k])**2
            T[i,j]=np.sqrt(diff_sq_sum)
    return T

Timings

I did all the timings in a Jupyter Notebook and observed a really weird behavior. The following code is in one cell. I also tried calling timit multiple times, but on the first execution of the cell this doesn't change anything.

First execution of the cell

D = 128
N = 3000
K = 500

X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))

#You can do it more often it would not change anything
%timeit func_1(X,Y,T)
%timeit func_1(X,Y,T)

#You can do it more often it would not change anything
%timeit func_2(X,Y,T)
%timeit func_2(X,Y,T)

###Avoid measuring compilation overhead###
%timeit func_3(X,Y,T)
##########################################
%timeit func_3(X,Y,T)

774 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
768 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.74 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Second execution

345 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
337 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
322 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.93 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, you're saying that it has to do with memory allocation more than broadcasting? And only happens on dual socket Linux systems? This is so very strange. — Przemek D, Sep 16 '19 at 07:27
I don't have a single socket Linux system available. This behavior is really weird. It looks like something is going wrong with memory management. I guess it would be good to open an issue https://github.com/numpy/numpy/issues on this topic. But anyway if speed matters for this function implement it in Cython or Numba. — max9111, Sep 16 '19 at 07:38

Broadcasting outside main loop speeds up vectorized numpy ops?

1 Answers1

It's not a broadcasting problem