How did numpy implement multi-dimensional broadcasting?

Question

Memory (row major order):

[[A(0,0), A(0,1)]
 [A(1,0), A(1,1)]]

has this memory layout: 
[A(0,0), A(0,1), A(1,0), A(1,1)]

I guess the algorithm work like this in the following cases.

Broadcasting Dimension is last dimension:

[[0, 1, 2, 3]         [[1]
                  x
 [4, 5, 6, 7]]         [10]]

   A (2 by 4)            B (2 by 1)

Iterate 0th dimensions of A and B simultaneously {
    Iterate last dimension of A{
        multiply;
    } 
}

Broadcasting dimension is 0th dimension:

[[0, 1, 2, 3]   
                  x    [[1,10,100,1000]]
 [4, 5, 6, 7]]

   A (2 by 4)              B (1 by 4)

Iterate 0th dimension of A{
    Iterate 1st dimensions of A and B simultaneously{
        multiply;
    }
}

Question:

How does numpy know which order of multiplication is the best. (reading memory in order is better than reading memory all over the place. but how did numpy figure that out?)
What would numpy do if the arrays have more than two dimension
What would numpy do if the broadcasting dimension is not the last dimension?

2nd guess of what is going on:

#include <iostream>
int main(void){
    const int nA = 12;
    const int nB = 3;
    int A[nA];
    int B[nB];
    for(int i = 0; i != nA; ++i) A[i] = i+1;
    for(int i = 0; i != nB; ++i) B[i] = i+1;
    //dimension
    int dA[] = {2,3,2};
    int dB[] = {1,3,1};

    int* pA = A;
    int* pB = B;
    int* pA_end = A + nA;
    //is it possible to make the compiler
    //generate the iA and sA?
    int iB = 0;
    int iB_max = 2;
    int sB[] = {1,0};

    while(pA != pA_end){
        std::cout << "*pA, *pB: " << *pA << ", " << *pB <<std::endl;
        std::cout << "iB: " << iB <<std::endl;
        *(pA) *= *(pB);
        ++pA;
        pB += sB[iB];
        ++iB;
        if (iB == iB_max) {iB = 0; pB = B;}
    }

    for(pA = A; pA != pA_end; ++pA){
        std::cout << *(pA) << ", ";
    }
    std::cout << std::endl;       
}

the 2nd guess seems wrong. numpy seems to use multi-index. – rxu Sep 22 '16 at 17:21 — rxu, Sep 22 '16 at 17:21

score 6 · Accepted Answer · edited May 23 '17 at 11:51

To really get into broadcasting details you need to understand array shape and strides. But a lot of the work is now implemented in c code using nditer. You can read about it at http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html. np.nditer gives you access to the tool at the Python level, but its real value comes when used with cython or your own c code.

np.lib.stride_tricks has functions that let you play with strides. One of its functions helps visualize how arrays are broadcasted together. In practice the work is done with nditer, but this function may help understand the action:

In [629]: np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3), 
                                  np.array([[1],[2]]))
Out[629]: 
[array([[0, 1, 2],
        [3, 4, 5]]), 
 array([[1, 1, 1],
        [2, 2, 2]])]

Note that, effectively the 2nd array has been replicated to match the 1st's shape. But the replication is done with stride tricks, not with actual copies.

In [631]: A,B=np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3), 
                                      np.array([[1],[2]]))
In [632]: A.shape
Out[632]: (2, 3)
In [633]: A.strides
Out[633]: (12, 4)
In [634]: B.shape
Out[634]: (2, 3)
In [635]: B.strides
Out[635]: (4, 0)

It's this (4,0) strides that does the replication without copy.

=================

Using the python level nditer, here's what it does during broadcasting.

In [1]: A=np.arange(6).reshape(2,3)
In [2]: B=np.array([[1],[2]])

The plain nditer feeds elements one set at a time http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html#using-an-external-loop

In [5]: it =np.nditer((A,B))
In [6]: for a,b in it:
   ...:     print(a,b)

0 1
1 1
2 1
3 2
4 2
5 2

But when I turn extenal_loop on, it iterates in chunks, here respective rows of the broadcasted arrays:

In [7]: it =np.nditer((A,B), flags=['external_loop'])
In [8]: for a,b in it:
   ...:     print(a,b)

[0 1 2] [1 1 1]
[3 4 5] [2 2 2]

With a more complex broadcasting the external_loop still produces 1d arrays that allow simple c style iteration:

In [13]: A1=np.arange(24).reshape(3,2,4)
In [18]: it =np.nditer((A1,np.arange(3)[:,None,None]), flags=['external_loop'])
In [19]: while not it.finished:
    ...:     print(it[:])
    ...:     it.iternext()
    ...:     
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([0, 0, 0, 0, 0, 0, 0, 0]))
(array([ 8,  9, 10, 11, 12, 13, 14, 15]), array([1, 1, 1, 1, 1, 1, 1, 1]))
(array([16, 17, 18, 19, 20, 21, 22, 23]), array([2, 2, 2, 2, 2, 2, 2, 2]))

Note that while A1 is (3,2,4), the nditer loop yields 3 steps (1st axis) with 2*4 length elements.

I found in another cython/nditer SO question that the first approach did not produce much of a speed improvement, but the 2nd helped a lot. In c or cython the external_loop case would do simple low level iteration.

===============

If I broadcast on the 1 and 3rd axis, the iterator takes 2*3 steps (effectively flattening the 1st 2 axes, and feeding the 3rd):

In [20]: it =np.nditer((A1,np.arange(2)[None,:,None]), flags=['external_loop'])
In [21]: while not it.finished:
    ...:     print(it[:])
    ...:     it.iternext()
    ...:     
(array([0, 1, 2, 3]), array([0, 0, 0, 0]))
(array([4, 5, 6, 7]), array([1, 1, 1, 1]))
(array([ 8,  9, 10, 11]), array([0, 0, 0, 0]))
(array([12, 13, 14, 15]), array([1, 1, 1, 1]))
(array([16, 17, 18, 19]), array([0, 0, 0, 0]))
(array([20, 21, 22, 23]), array([1, 1, 1, 1]))

But with buffered, it iterates once, feeding me 2 1d arrays:

In [22]: it =np.nditer((A1,np.arange(2)[None,:,None]), flags=['external_loop','buffered'])
In [23]: while not it.finished:
    ...:     print(it[:])
    ...:     it.iternext()
    ...:     
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]), 
 array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]))

Does Cython offer any reasonably easy and efficient way to iterate Numpy arrays as if they were flat? has some speed tests, showing that buffered external loop is fastest

cython translates this into fast simple c iteration:

for xarr in it:
   x = xarr
   size = x.shape[0]
   for i in range(size):
       x[i] = x[i]+1.0

`np.lib.stride_tricks.broadcast_arrays` is better spelt [`np.broadcast_arrays`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.broadcast_arrays.html) — Eric, Sep 21 '16 at 22:14
Though the whole `stride_tricks.py` file is good reading if you need to dig deeper. — hpaulj, Sep 21 '16 at 22:20
I try to read the nditer c code, but don't quite understand it yet. Does nditer sacrifice performance? Does niter treat the nested loop as a single loop with variable stride? — rxu, Sep 22 '16 at 15:05
I haven't dug into the `nditer` code; with all the options it is complicated. But I've added some examples that show how it iterates. `external_loop` is needed to get the best speed improvement. — hpaulj, Sep 22 '16 at 16:13

How did numpy implement multi-dimensional broadcasting?

1 Answers1