NUMPY Implementation of AES significantly slower than pure python

Question

I am looking at re-implementing the SlowAES code (http://anh.cs.luc.edu/331/code/aes.py) to try and take advantage of the native array support of numpy. I'm getting what, to me, is the counter-intuitive result that the pure Python of SlowAES is much, much faster than the same functions implemented using numpy. Here is the clearest example I have.

One of the main operations in AES is Shift Rows, where each row in the 4x4 element byte array is shifted by some number of positions (0 for row 0, 1 for row 1, etc.). The original Python code treats this 4x4 byte state array as a one dimensional 16-element list, then uses slicing to create virtual rows to rotate:

def rotate(word, n):
    return word[n:] + word [0:n]

def shiftRows(state):
    for i in range(4):
        state[i*4:i*4+4] = rotate(state[i*4:i*4+4], -i)

Running timeit on shiftRows using a list of 16 integers results in a time of 3.47 microseconds.

Re-implementing this same function in numpy, assuming a 4x4 integer input array, would be simply:

def shiftRows(state):
    for i in range(4):
        state[i] = np.roll(state[i],-i)

However, timeit shows this to have an execution time of 16.3 microseconds.

I was hoping numpy's optimized array operations might result in somewhat faster code. Where am I going wrong? And is there some approach that would result in a faster AES implementation than pure Python? There are some intermediate results that I want to get at, so pycrypto may not be applicable (though if this is going to be too slow, I may have to take a second look).

07 Sep 2016 - Thanks for the answers. To answer the question of "why," I'm looking at running hundreds of thousands, if not millions, of sample plaintext/ciphertext pairs. So, while the time difference for any single encryption makes little difference, any time savings I can get could make a huge difference in the long run.

"And is there some approach that would result in a faster AES implementation than pure Python?" Almost certainly. Do it in C and export a binding to Python. However, I suspect this isn't the answer you're looking for. — Sandy Chapman, Sep 06 '16 at 20:28
`Numpy` usually has more "control" that slows it down if compared to plain "stripped-down" source. Check the sourse code for `roll` here: https://github.com/numpy/numpy/blob/v1.11.0/numpy/core/numeric.py#L1335-L1401 — RafazZ, Sep 06 '16 at 20:29
3.5 vs 16 microseconds? does this problem scale up to where you might be talking 10s of seconds or minutes? I am not seeing a big problem here, unless this is an optimization for optimization sake question. — , Sep 06 '16 at 20:30
Related? http://stackoverflow.com/questions/6559463/why-is-numpy-array-so-slow — James K, Sep 06 '16 at 20:31
I assume `block` in the numpy version should actually be `state`. — Warren Weckesser, Sep 06 '16 at 20:31

score 1 · Answer 1 · answered Sep 06 '16 at 20:43

The simple answer is that there's a lot of overhead in creating arrays. So operations on small lists usually are faster than equivalent ones on arrays. That's especially true if the array version is iterative like the list one. For large arrays, operations using compiled methods, will be faster.

These 4 'roll' timings illustrate this

For a small list:

In [93]: timeit x=list(range(16)); x=x[8:]+x[:8]
100000 loops, best of 3: 2.75 µs per loop
In [94]: timeit y=np.arange(16); y=np.roll(y,8)
The slowest run took 40.90 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.5 µs per loop

for a large one:

In [95]: timeit x=list(range(1000)); x=x[500:]+x[:500]
10000 loops, best of 3: 52.9 µs per loop
In [96]: timeit y=np.arange(1000); y=np.roll(y,500)
The slowest run took 28.91 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.2 µs per loop

We could further refine the question by extracting the range and arange steps out of the timing loop.

The np.roll operation is essentially:

y[np.concatenate((np.arange(8,16), np.arange(0,8)))]

That constructs 4 arrays, the 2 arange, the concatenate, and the final indexed array.

NUMPY Implementation of AES significantly slower than pure python

1 Answers1