Fast tensor rotation with NumPy

Question

At the heart of an application (written in Python and using NumPy) I need to rotate a 4th order tensor. Actually, I need to rotate a lot of tensors many times and this is my bottleneck. My naive implementation (below) involving eight nested loops seems to be quite slow, but I cannot see a way to leverage NumPy's matrix operations and, hopefully, speed things up. I've a feeling I should be using np.tensordot, but I don't see how.

Mathematically, elements of the rotated tensor, T', are given by: T'_ijkl = Σ g_ia g_jb g_kc g_ld T_abcd with the sum being over the repeated indices on the right hand side. T and Tprime are 3*3*3*3 NumPy arrays and the rotation matrix g is a 3*3 NumPy array. My slow implementation (taking ~0.04 seconds per call) is below.

#!/usr/bin/env python

import numpy as np

def rotT(T, g):
    Tprime = np.zeros((3,3,3,3))
    for i in range(3):
        for j in range(3):
            for k in range(3):
                for l in range(3):
                    for ii in range(3):
                        for jj in range(3):
                            for kk in range(3):
                                for ll in range(3):
                                    gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
                                    Tprime[i,j,k,l] = Tprime[i,j,k,l] + \
                                         gg*T[ii,jj,kk,ll]
    return Tprime

if __name__ == "__main__":

    T = np.array([[[[  4.66533067e+01,  5.84985000e-02, -5.37671310e-01],
                    [  5.84985000e-02,  1.56722231e+01,  2.32831900e-02],
                    [ -5.37671310e-01,  2.32831900e-02,  1.33399259e+01]],
                   [[  4.60051700e-02,  1.54658176e+01,  2.19568200e-02],
                    [  1.54658176e+01, -5.18223500e-02, -1.52814920e-01],
                    [  2.19568200e-02, -1.52814920e-01, -2.43874100e-02]],
                   [[ -5.35577630e-01,  1.95558600e-02,  1.31108757e+01],
                    [  1.95558600e-02, -1.51342210e-01, -6.67615000e-03],
                    [  1.31108757e+01, -6.67615000e-03,  6.90486240e-01]]],
                  [[[  4.60051700e-02,  1.54658176e+01,  2.19568200e-02],
                    [  1.54658176e+01, -5.18223500e-02, -1.52814920e-01],
                    [  2.19568200e-02, -1.52814920e-01, -2.43874100e-02]],
                   [[  1.57414726e+01, -3.86167500e-02, -1.55971950e-01],
                    [ -3.86167500e-02,  4.65601977e+01, -3.57741000e-02],
                    [ -1.55971950e-01, -3.57741000e-02,  1.34215636e+01]],
                   [[  2.58256300e-02, -1.49072770e-01, -7.38843000e-03],
                    [ -1.49072770e-01, -3.63410500e-02,  1.32039847e+01],
                    [ -7.38843000e-03,  1.32039847e+01,  1.38172700e-02]]],
                  [[[ -5.35577630e-01,  1.95558600e-02,  1.31108757e+01],
                    [  1.95558600e-02, -1.51342210e-01, -6.67615000e-03],
                    [  1.31108757e+01, -6.67615000e-03,  6.90486240e-01]],
                   [[  2.58256300e-02, -1.49072770e-01, -7.38843000e-03],
                    [ -1.49072770e-01, -3.63410500e-02,  1.32039847e+01],
                    [ -7.38843000e-03,  1.32039847e+01,  1.38172700e-02]],
                   [[  1.33639532e+01, -1.26331100e-02,  6.84650400e-01],
                    [ -1.26331100e-02,  1.34222177e+01,  1.67851800e-02],
                    [  6.84650400e-01,  1.67851800e-02,  4.89151396e+01]]]])

    g = np.array([[ 0.79389393,  0.54184237,  0.27593346],
                  [-0.59925749,  0.62028664,  0.50609776],
                  [ 0.10306737, -0.56714313,  0.8171449 ]])

    for i in range(100):
        Tprime = rotT(T,g)

Is there a way to make this go faster? Making the code generalise to other ranks of tensor would be useful, but is less important.

And, if it becomes clear that doing this more quickly in numpy or scipy isn't easy, I'll bash out a Fortran extension module and see how that performs. — Andrew Walker, Feb 10 '11 at 21:15
If all else fails, you could use Cython. It supposedly [plays well with numpy](http://docs.cython.org/src/tutorial/numpy.html). — , Feb 10 '11 at 21:17
While I'm moderately sure there's a way do to this will fewer nested loops in numpy (I don't immediately see it, though), as @delnan said, your current code is a prime candidate for Cython.... — Joe Kington, Feb 10 '11 at 21:22

Philipp · Accepted Answer · 2011-02-10T22:24:10.740

To use tensordot, compute the outer product of the g tensors:

def rotT(T, g):
    gg = np.outer(g, g)
    gggg = np.outer(gg, gg).reshape(4 * g.shape)
    axes = ((0, 2, 4, 6), (0, 1, 2, 3))
    return np.tensordot(gggg, T, axes)

On my system, this is around seven times faster than Sven's solution. If the g tensor doesn't change often, you can also cache the gggg tensor. If you do this and turn on some micro-optimizations (inlining the tensordot code, no checks, no generic shapes), you can still make it two times faster:

def rotT(T, gggg):
    return np.dot(gggg.transpose((1, 3, 5, 7, 0, 2, 4, 6)).reshape((81, 81)),
                  T.reshape(81, 1)).reshape((3, 3, 3, 3))

Results of timeit on my home laptop (500 iterations):

Your original code: 19.471129179
Sven's code: 0.718412876129
My first code: 0.118047952652
My second code: 0.0690279006958

The numbers on my work machine are:

Your original code: 9.77922987938
Sven's code: 0.137110948563
My first code: 0.0569641590118
My second code: 0.0308079719543

btw, cython version is 4 times faster than the first code http://stackoverflow.com/questions/4962606/fast-tensor-rotation-with-numpy/4973390#4973390 — jfs, Feb 11 '11 at 19:56
Posted a [`close one with four levels of tensordot`](http://stackoverflow.com/a/42347571/3293881). Thought might be of interest to you :) — Divakar, Feb 20 '17 at 14:53

score 34 · Answer 2 · answered Feb 10 '11 at 21:28

34

Here is how to do it with a single Python loop:

def rotT(T, g):
    Tprime = T
    for i in range(4):
        slices = [None] * 4
        slices[i] = slice(None)
        slices *= 2
        Tprime = g[slices].T * Tprime
    return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)

Admittedly, this is a bit hard to grasp at first glance, but it's quite a bit faster :)

answered Feb 10 '11 at 21:28

Sven Marnach

574,206
118
941
841

6

+1 for math awesomeness and for yet another proof that algorithmic optimizations beat microoptimizations by orders of magnitude. – Feb 10 '11 at 21:38
2

Quite nice! +1 Along those lines, there's a function that may be included in future versions of numpy that would also fit the OP's purpose... `numpy.einsum` (It's not there yet). See this discussion on numpy-discussion: http://www.mail-archive.com/numpy-discussion@scipy.org/msg29680.html – Joe Kington Feb 10 '11 at 21:40

score 19 · Answer 3 · answered Feb 20 '11 at 21:20

Thanks to hard work by M. Wiebe, the next version of Numpy (which will probably be 1.6) is going to make this even easier:

>>> Trot = np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)

Philipp's approach is at the moment 3x faster, though, but perhaps there is some room for improvement. The speed difference is probably mostly due to tensordot being able to unroll the whole operation as a single matrix product that can be passed on to BLAS, and so avoiding much of the overhead associated with small arrays --- this is not possible for general Einstein summation, as not all operations that can be expressed in this form resolve to a single matrix product.

Used your approach to illustrate [my point and approach using four `tensordots`](http://stackoverflow.com/a/42347571/3293881). Good concise one with `einsum`! — Divakar, Feb 20 '17 at 14:55

score 10 · Answer 4 · edited May 23 '17 at 10:31

Out of curiosity I've compared Cython implementation of a naive code from the question with the numpy code from @Philipp's answer. Cython code is four times faster on my machine:

#cython: boundscheck=False, wraparound=False
import numpy as np
cimport numpy as np

def rotT(np.ndarray[np.float64_t, ndim=4] T,
         np.ndarray[np.float64_t, ndim=2] g):
    cdef np.ndarray[np.float64_t, ndim=4] Tprime
    cdef Py_ssize_t i, j, k, l, ii, jj, kk, ll
    cdef np.float64_t gg

    Tprime = np.zeros((3,3,3,3), dtype=T.dtype)
    for i in range(3):
        for j in range(3):
            for k in range(3):
                for l in range(3):
                    for ii in range(3):
                        for jj in range(3):
                            for kk in range(3):
                                for ll in range(3):
                                    gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
                                    Tprime[i,j,k,l] = Tprime[i,j,k,l] + \
                                         gg*T[ii,jj,kk,ll]
    return Tprime

score 3 · Answer 5 · answered Aug 18 '13 at 17:39

I thought I'd contribute a relatively new data point to these benchmarks, using parakeet, one of the numpy-aware JIT compilers that's sprung up in the past few months. (The other one I'm aware of is numba, but I didn't test it here.)

After you make it through the somewhat labyrinthine installation process for LLVM, you can decorate many pure-numpy functions to (often) speed up their performance :

import numpy as np
import parakeet

@parakeet.jit
def rotT(T, g):
    # ...

I only tried applying the JIT to Andrew's code in the original question, but it does pretty well (> 10x speedup) for not having to write any new code whatsoever :

andrew      10 loops, best of 3: 206 msec per loop
andrew_jit  10 loops, best of 3: 13.3 msec per loop
sven        100 loops, best of 3: 2.39 msec per loop
philipp     1000 loops, best of 3: 0.879 msec per loop

For these timings (on my laptop) I ran each function ten times, to give the JIT a chance to identify and optimize the hot code paths.

Interestingly, Sven and Philipp's suggestions are still orders of magnitude faster !

I'm late in joining this conversation, but I just wanted to point out that these methods scale differently with the size of the data. When I change the tensors from being 3x3x3x3 to 7x7x7x7 then Parakeet and Numba both take ~10ms on my machine, but Phillip's first solution takes ~80ms. — Alex Rubinsteyn, Nov 25 '13 at 01:37
have you tried to compare it with [the cython implementation of the naive code from the original question](http://stackoverflow.com/a/4973390/4279)? — jfs, Oct 08 '14 at 23:34

score 2 · Answer 6 · edited May 23 '17 at 11:54

Prospective Approach and solution code

For memory efficiency and thereafter performance efficiency, we could use tensor matrix-multiplication in steps.

To illustrate the steps involved, let's use the simplest of the solutions with np.einsum by @pv. -

np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)

As seen, we are losing the first dimension from g with tensor-multiplication between its four variants and T.

Let's do those sum-reductions for tensor matrix multiplications in steps. Let's start off with the first variant of g and T :

p1 = np.einsum('abcd, ai->bcdi', T, g)

Thus, we end up with a tensor of dimensions as string notation : bcdi. The next steps would involve sum-reducing this tensor against the rest of the three g variants as used in the original einsum implmentation. Hence, the next reduction would be -

p2 = np.einsum('bcdi, bj->cdij', p1, g)

As seen, we have lost the first two dimensions with the string notations : a, b. We continue it for two more steps to get rid of c and d too and would be left with ijkl as the final output, like so -

p3 = np.einsum('cdij, ck->dijk', p2, g)

p4 = np.einsum('dijk, dl->ijkl', p3, g)

Now, we could use np.tensordot for these sum-reductions, which would be much more efficient.

Final implementation

Thus, porting over to np.tensordot, we would have the final implementation like so -

p1 = np.tensordot(T,g,axes=((0),(0)))
p2 = np.tensordot(p1,g,axes=((0),(0)))
p3 = np.tensordot(p2,g,axes=((0),(0)))
out = np.tensordot(p3,g,axes=((0),(0)))

Runtime test

Let's test out all the NumPy based approaches posted across other posts to solve the problem on performance.

Approaches as functions -

def rotT_Philipp(T, g):  # @Philipp's soln
    gg = np.outer(g, g)
    gggg = np.outer(gg, gg).reshape(4 * g.shape)
    axes = ((0, 2, 4, 6), (0, 1, 2, 3))
    return np.tensordot(gggg, T, axes)

def rotT_Sven(T, g):    # @Sven Marnach's soln
    Tprime = T
    for i in range(4):
        slices = [None] * 4
        slices[i] = slice(None)
        slices *= 2
        Tprime = g[slices].T * Tprime
    return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)    

def rotT_pv(T, g):     # @pv.'s soln
    return np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)

def rotT_Divakar(T,g): # Posted in this post
    p1 = np.tensordot(T,g,axes=((0),(0)))
    p2 = np.tensordot(p1,g,axes=((0),(0)))
    p3 = np.tensordot(p2,g,axes=((0),(0)))
    p4 = np.tensordot(p3,g,axes=((0),(0)))
    return p4

Timings with the original dataset sizes -

In [304]: # Setup inputs 
     ...: T = np.random.rand(3,3,3,3)
     ...: g = np.random.rand(3,3)
     ...: 

In [305]: %timeit rotT(T, g)
     ...: %timeit rotT_pv(T, g)
     ...: %timeit rotT_Sven(T, g)
     ...: %timeit rotT_Philipp(T, g)
     ...: %timeit rotT_Divakar(T, g)
     ...: 
100 loops, best of 3: 6.51 ms per loop
1000 loops, best of 3: 247 µs per loop
10000 loops, best of 3: 137 µs per loop
10000 loops, best of 3: 41.6 µs per loop
10000 loops, best of 3: 28.3 µs per loop

In [306]: 6510.0/28.3 # Speedup with the proposed soln over original code
Out[306]: 230.03533568904592

As discussed at the start of this post, we are trying to achieve memory efficiency and hence performance boost with it. Let's test that out as we increase the dataset sizes -

In [307]: # Setup inputs 
     ...: T = np.random.rand(5,5,5,5)
     ...: g = np.random.rand(5,5)
     ...: 

In [308]: %timeit rotT(T, g)
     ...: %timeit rotT_pv(T, g)
     ...: %timeit rotT_Sven(T, g)
     ...: %timeit rotT_Philipp(T, g)
     ...: %timeit rotT_Divakar(T, g)
     ...: 
100 loops, best of 3: 6.54 ms per loop
100 loops, best of 3: 7.17 ms per loop
100 loops, best of 3: 2.7 ms per loop
1000 loops, best of 3: 1.47 ms per loop
10000 loops, best of 3: 39.9 µs per loop

Patol75 · Answer 7 · 2019-10-15T07:04:19.400

Not a new answer, as all the previous ones deal well with the question. More like a comment, but I post it as an answer to have some space for the code.

While all answers do reproduce the result of the original post, I am pretty sure that the code provided in the original post is wrong. Looking at the formula T'_ijkl = Σ g_ia g_jb g_kc g_ld T_abcd, which I believe is correct, the indices for g that are varied in the calculation of each entry of T' are a, b, c & d. However, in the original provided code, the indices used to access the values of g in the calculation of gg are swapped with regard to the formula. Hence, I believe the following code actually provides the correct implementation of the formula:

def rotT(T, g):
    Tprime = np.zeros((3, 3, 3, 3))
    for i in range(3):
        for j in range(3):
            for k in range(3):
                for l in range(3):
                    for a in range(3):
                        for b in range(3):
                            for c in range(3):
                                for d in range(3):
                                    Tprime[i, j, k, l] += \
                                        g[i, a] * g[j, b] * \
                                        g[k, c] * g[l, d] * T[a, b, c, d]

The equivalent, but faster, calls to einsum and tensordot update to:

Tprime = np.tensordot(g, np.tensordot(g, np.tensordot(
    g, np.tensordot(g, T, (1, 3)), (1, 3)), (1, 3)), (1, 3))
Tprime = np.einsum('ia, jb, kc, ld, abcd->ijkl', g, g, g, g, T)

Additionally, using @jit(nopython=True) from numba on the naive loops function is five times faster than using numpy.tensordot on my machine.

It is good to see *Numba* finally make an appearance in this thread. I don't think anyone has mentioned the underlying fact that Python for loops are *sloowww*, but that if you compile them using Numba, all those loops are converted into LVLL code that runs very fast. Numba allows you to take advantage of parallel operations and even gpus if you *really* need speed. — Diomedea, Mar 12 '21 at 10:02

Fast tensor rotation with NumPy

7 Answers7

Prospective Approach and solution code

Runtime test

Linked