3

I want to apply Arnold's cat map to my matrix. Here is my implementation:

import numpy as np

def cat_mapping(matrix, MAX):
    width, height = matrix.shape
    transformed_matrix = np.empty([width, height]).astype(np.uint8)
    counter = 1
    x = np.repeat(np.arange(width).reshape(-1, 1), height, axis=-1).T
    y = np.repeat(np.arange(height).reshape(-1, 1), width, axis=-1)
    nx = (2 * x + y) % width
    ny = (x + y) % height
    while counter <= MAX:
        transformed_matrix[ny, nx] = matrix[y, x]
        matrix = transformed_matrix
        if counter != MAX:
            transformed_matrix = np.empty([width, height])
        counter = counter + 1
    return transformed_matrix

Which work perfectly. But when the size of the array increase >10000 with bigger iteration value MAX, this implementation became really slow. Even I use numba, but the result is not satisfactory.

I was thinking, can the transformation could be broken into smaller part and combine the result like Divide and Conquer does?

Update

@JeromeRichard helped to make it faster using numba which is nice. But, I think, is it become more faster if somehow we manage to implement DC paradigm?. I tried to implement with some demo data like this:

def split(matrix):
    row, col = matrix.shape
    row2, col2 = row//2, col//2
    return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2], matrix[row2:, col2:]

main = np.arange(1000*1000).reshape(1000,1000)
a,b,c,d = split(main)
a = cat_mapping_fast(a,100)
b = cat_mapping_fast(b,100)
c = cat_mapping_fast(c,100)
d = cat_mapping_fast(d,100)
np.vstack((np.hstack((a, b)), np.hstack((c, d))))

But I couldn't come up with deeper recursion because of "How to merge them?".

Any solution or hint will be appreciated.

falamiw
  • 426
  • 4
  • 16
  • I think you should use `height, width = matrix.shape` instead and `np.empty([height, width])` since then you access to `matrix[y, x]` and `x` is an array of size `width`. This might explain why the code works only for square matrices. – Jérôme Richard Apr 24 '22 at 09:44

1 Answers1

1

The current code is quite slow because of matrix[y, x] create a new temporary array, and transformed_matrix[ny, nx] = matrix[y, x] is pretty slow since it needs to read nx and ny from the memory hierarchy and the memory access pattern is not efficient. When the matrix is big, the code should be memory bound and the unneeded memory operation becomes pretty expensive. Note that the np.empty([width, height]) array contains double-precision floating-point numbers that takes 8 time more space than np.uint8 and so it is 8 times slower to fill in memory.

You can speed up a lot the code using Numba, basic loops and double buffering so to avoid creating many temporary arrays and read big ones. The idea is to compute the indices (ny, nx) on-the-fly within the loops. Since modulus are quite expensive and the memory access pattern cause the code to be more latency bound, multiple threads are used so to better saturate the memory. Here is the resulting code:

import numba as nb

@nb.njit('uint8[:,::1](uint8[:,::1], int_)', parallel=True)
def cat_mapping_fast(matrix, MAX):
    height, width = matrix.shape
    buff1 = np.empty((height, width), dtype=np.uint8)
    buff2 = np.empty((height, width), dtype=np.uint8)
    counter = 1
    buff1[:,:] = matrix
    for counter in range(1, MAX+1):
        for y in nb.prange(height):
            for x in range(width):
                nx = (2 * x + y) % width
                ny = (x + y) % height
                buff2[ny, nx] = buff1[y, x]
        buff1, buff2 = buff2, buff1
    return buff1

This code is significantly faster than the initial one, especially when the 2 buffers fit in the CPU cache. When the input matrix is so huge that it does not fit in the cache, the inefficient memory access pattern makes the code a bit slower but there is not much to do since the computation appears to behave like a big shuffle which is not cache-friendly. Still, on a 4096x4096 matrix with MAX=20, the Numba code is 25 times faster on my 6-core machine (only about 0.38 seconds compared to 10 seconds for the initial code).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thanks for your response @JeromeRichard. I try to run your implementation like this way `cat_mapping_fast(np.arange(100*100).reshape(100,100),4)` but it gives `TypeError: No matching definition for argument type(s) array(int32, 2d, C), int64` error. Any idea? – falamiw Apr 24 '22 at 12:24
  • This is because the Numba function is designed only for the `np.uint8` types. This is critical for performance. Thus, if you input array value fit in the `np.uint8` type, then you should convert the array to a `np.uint8` using `astype`. Otherwise, you can modify the Numba signature (ie. the string `'uint8[:,::1](uint8[:,::1], int_)'`). Alternatively you could just remove the signature string and just use `@nb.njit(parallel=True)` to let Numba compile the function at runtime during the first call at the expense of a slower execution due to possibly suboptimal types (and the lazy compilation). – Jérôme Richard Apr 24 '22 at 12:49
  • 1
    Note that you can also provide multiple signature with something like `'[uint8[:,::1](uint8[:,::1], int_)', 'uint32[:,::1](uint32[:,::1], int_)']`, but be aware that Numba need to compile each function separately (which could be expensive if you want to support the 8 possible integer types). – Jérôme Richard Apr 24 '22 at 12:51
  • Aha, I understand now. Thanks(+1) for your help @JérômeRichard – falamiw Apr 24 '22 at 13:20
  • Can you suggest me on my `update`? @JérômeRichard – falamiw Apr 25 '22 at 17:01
  • I think the divide and conquer approach will not be significantly faster since the memory accesses is fundamentally very sparse. Even for a given small output tile to write, it appears you need many cells that are not near each other. Divide and conquer works well when the smaller problems are easier/faster to compute but this does not seems to be the case here. Moreover, it adds a new overhead due to the function calls. I am not 100% sure there is no way to make a DC approach efficient, but I am clearly not convinced of the naive DC approach. – Jérôme Richard Apr 25 '22 at 18:42
  • If you want to speed up the code, then you can try to reverse the `transformed_matrix[ny, nx] = matrix[y, x]` operation so that the written values are contiguous but not the read ones. Since the operation is a permutation (ie. bijection), the reverse permutation can be computed. Doing that will result in a more cache-friendly operation since written cache lines will not be evicted. It should also be faster because of the write-allocate cache policy of mainstream processors. – Jérôme Richard Apr 25 '22 at 18:45
  • By the way, it turns out you need to use this method (ie. reversing the permutation) so to be able to merge the tiles for the DC approach anyway. – Jérôme Richard Apr 25 '22 at 18:59