Performance decreases with increasing nesting of array elements

Question

A short note: This question relates to another I asked previously, but since asking multiple questions within a single Q&A is concidered bad SO-style I splitted it up.

Setup

I have the following two implementations of a matrix-calculation:

The first implementation uses a matrix of shape (n, m) and the calculation is repeated in a for-loop for repetition-times:

import numpy as np


def foo():
    for i in range(1, n):
        for j in range(1, m):

            _deleteA = (
                        matrix[i, j] +
                        #some constants added here
            )
            _deleteB = (
                        matrix[i, j-1] +
                        #some constants added here
            )
            matrix[i, j] = min(_deleteA, _deleteB)

    return matrix

repetition = 3
for x in range(repetition):
    foo()

2. The second implementation avoids the extra for-loop and, hence, includes repetition = 3 into the matrix, which is then of shape (repetition, n, m):

def foo():
    for i in range(1, n):
        for j in range(1, m):

            _deleteA = (
                        matrix[:, i, j] +
                        #some constants added here
            )
            _deleteB = (
                        matrix[:, i, j-1] +
                        #some constants added here
            )
            matrix[:, i, j] = np.amin(np.stack((_deleteA, _deleteB), axis=1), axis=1)

    return matrix

Question

Regarding both implementations, I discovered this regarding their performance with %timeit in iPython:

The first implementation is faster (in my test-case with n=1000, m=1000: 17sec vs. 26sec). Why is numpy such slower when working on three instead of two dimensions?

Everything is inside memory for fast access (`matrix` already built and laoded). Both test-cases had the same test environment and same initial preconditions. This did not just happen in a single test - I repeated it again after some days etc. with still the same result — Markus, Jul 16 '19 at 11:12
Memory as in RAM or memory as CPU cache? Makes a huge difference — Thomas Weller, Jul 16 '19 at 11:13
Since I repeated the `%timeit` several times for each case I'd assume that it would be already in CPU cache as well, but I'm not that of an expert to answer this correctly, sorry. I just can state that both cases had the same preconditions. — Markus, Jul 16 '19 at 11:17
The memory use of the second version is 3x more than the first version since it looks like the first version operates on the same matrix 3 times, whereas the second version operates once on a matrix 3x the size? How long does it take to build the matrix in the first place? — Tom Dalton, Jul 16 '19 at 11:18
@TomDalton: The matrix creation is another process and, therefore, the timing for its creation should be out of scope for the question (since it alread exists within RAM and is not counted into the `timeit`). This is rather about why the access-time of the array differs this much with increasing nesting. — Markus, Jul 16 '19 at 11:28
The CPU cache has a limited size, e.g. 2 MB or 6 MB (you need to look that up for your CPU model). If the size of the matrix is beyond that size, things become slow. The 3D matrix is certainly larger, so that might explain the difference. It depends on the dimensions of the matrix. A 1:1:1 matrix is probably faster than a 1000:1000 matrix. — Thomas Weller, Jul 16 '19 at 13:00

Performance decreases with increasing nesting of array elements

0 Answers0

Linked