Why is numba slicing so much faster than numpy slicing?

Question

def test(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    
@njit
def test2(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]

x = np.arange(5)    

test2(x)

%timeit test(x)
%timeit test2(x)

776 ns ± 1.83 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
280 ns ± 2.53 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The gap between them widens as the slices increase

def test(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:3]
    o = x[2:3]
    
@njit
def test2(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:3]
    o = x[2:3]
    
test2(x)

%timeit test(x)
%timeit test2(x)
1.18 µs ± 1.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
279 ns ± 0.562 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

numpy function seems to become linearly slower and the numba function doesn't care about how many times you slice it(which was what i was expecting to happen on both cases)

EDIT:

After chrslg answer I decided to put a return statement of both functions. Just input on both

return k,l,m,n,o

and the timings were:

1.23 µs ± 2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.61 µs ± 9.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So the numba function now becomes slower, which seems to make that its indeed just a dead code. However, after seeing user jared comment, i decided to test the same product operation with the slices that he tried:

def test5(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:4]
    o = x[2:3]
    return (k*l*m*n*o)
    
@njit
def test6(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:4]
    o = x[2:3]
    return (k*l*m*n*o)
    
test6(x)

%timeit test5(x)
%timeit test6(x)
5.79 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
787 ns ± 1.52 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now the the numba function becomes faster than the simply return function(!) and the speed gap widens again. I am honestly more confused now.

Numba mostly detects unused code and optimizes it away (dead code elimination). Always do something with the slices, otherwise you are very likely testing a function which doesn't do anything. — max9111, Jul 17 '23 at 07:43
The complexity of the compiled code differs. It is very likely more complicated to return more slices as returning one slice with a few very simple multipilcations (which takes only a few ns). For the details it is always useful to have a look at the llvm-ir eg. `print(test6.inspect_llvm(test6.signatures[0]))` — max9111, Jul 17 '23 at 15:29
I did this but honestly i can't understand anything on the signatures. Why is the numba version so much faster than the numpy version on the last test though? Elimination of temporaries? — Klaus3, Jul 17 '23 at 15:35
There is just one array (slice) returned instead of 5. If you blow up this example to let's say slices of size 1_000_000 the relative timings will also differ because of that. — max9111, Jul 17 '23 at 15:38
Maybe i should go for another question, but my intention is exactly to deal with large slices. And doing operations with a massive number of slices becomes more favourable for numba. I did one function where i slice one vector into multiple variables, operate and return them and numba function is 10x faster. Array size was about 5000. — Klaus3, Jul 17 '23 at 15:44

chrslg · Answer 1 · 2023-07-18T13:53:29.507

Because it is probably doing nothing.

I get the same timings, for numba, for this function

def test3(x):
    pass

Note that test does almost nothing neither. Those are just slices, without any actions associated to it. So, no data transfer or anything. Just the creation of 3 variables, and some boundaries tweaks.

If the code was about an array of 5000000 elements, and slices of 1000000 elements from it, it wouldn't be slower. Hence the reason, I suppose, for which, when you wanted to scale a bit things on "bigger" case, you decided not to increase data size (because you probably knew that data size was not relevant here), but to increase the number of lines.

But, well, test, even doing almost nothing, is still doing these 3, then unused, slices.

Where as numba compiles some generated C code. And compiler, with optimizer, has no reason to keeps those slice variables that are never used afterward.

I totally speculate here (I've never seen numba's generated code). But I imagine code could look line

void test2(double *x, int xstride, int xn){
    double *k = x+1*xstride;
    int kstride=xstride;
    int kn=1;
    double *l=x;
    int lstride=xstride;
    int ln=3;
    double *m=x;
    int mstride=xstride;
    int mn=1;
    // And then it would be possible, from there to iterates those slices
    // for example k[:]=m[:] could generate this code
    // for(int i=0; i<kn; i++) k[i*stride] = m[i*stride]
}

(Here I use stride the size in double * arithmetics, when in reality strides is in bytes, but it doesn't matter, that is just pseudo code)

My point is, if there was something afterward (like what I put in comment), then, this code, even if it is just a few arithmetics operations, would still be "almost nothing, but not nothing".

But there is nothing afterward. So, it is just some local variable initialization, with code with clearly no side effect. It is very easy for the compiler optimizer to just drop all that code. And compile an empty function, which has exactly the same effect and result.

So, again, just speculation on my behalf. But any decent code generator+compiler should just compile an empty function for test2. So test2 and test3 are the same things.

While an interpreter does not, usually this kind of optimization (first, it is harder to know in advance what is coming, and second, time spent to optimize is at runtime, so there is a tradeoff, when, for a compiler, even if it takes 1 hour of compile time to spare 1ns of runtime, it still worth it)

Edit: some more experiments

The idea that jared and I both had, that is doing something, whatever it is, to force the slices to exist, and compare what happens with numba when it has to do something, and therefore to really do the slices, is natual. The problem is that as soon as you start doing something, anything, the the timing of the slices themselves become negligible. Because slicing is nothing.

But, well, statistically you can remove that and still measure somehow the "slicing" part.

Here are some timings data

Empty function

On my computer, an empty function cost 130 ns in pure python. And 540 ns with numba.

Which is not surprising. Doing nothing, but doing so while crossing the "python/C frontier" probably cost a bit, just for that "python/C". Not much tho

Time vs number of slices

Next experiment is the exact one you made (since, btw, your post contain its own proof of my answer: you already saw that in pure python time is O(n), n being the number of slices, when in numba it is O(1). That alone proves that there is no slicing at all occurring. If the slicing were done, in numba as in any other non-quantum computer :D, cost has to be O(n). Now, of course if it t=100+0.000001*n, it might be hard to distinguish O(n) from O(1). Hence the reason why I started by evaluating the "empty" case

In pure python, slicing only, with increasing number of slices in obviously O(n), indeed:

A linear regression tells that this is roughly 138+n×274, in ns.

Which is consistent with "empty" time

For numba, on the other hand, we get

So no need for a linear regression to prove that

It is indeed O(1)
Timing is consistent with the 540 ns of "empty" case

Note that this means that, for n=2 or more slices, on my computer, numba becomes competitive. Before, it is not. But, well, competition in doing "nothing"...

With usage of slices

If we add code afterward to force the usage of the slices, of course, things change. Compiler can't just remove slices.

But we have to be careful

To avoid having a O(n) addition in the operation itself
To distinguish the timing of the operation from the, probably negligible timing of the slicing

What I did then, is computing some addition slice1[1]+slice2[2]+slice3[3]+...

But what ever the number of slices, I have 1000 terms in this addition. So for n=2 (2 slices), that addition is slice1[1]+slice2[2]+slice1[3]+slice2[4]+... with 1000 terms.

That should help remove the O(n) part due to that addition. And then, with big enough data, we can extract some value from the variations around this, even tho the variations are quite negligible before the addition time itself (and therefore even before the noise of that addition time. But with enough measurement, that noise become low enough to start seeing things)

In pure python

A linear regressions gives 199000 + 279×n ns

What we learn from this, is that my experimental setup is ok. 279 is close enough to the previous 274 to say that, indeed, the addition part, as huge as it is (200000 ns) is O(1), since the O(n) part remained unchanged compared to slicing only. So we just have the same timing as before + a huge constant for the addition part.

With numba

All that was just the preamble to justify the experimental setup. Now come the interesting part, the experiment itself

Linear regression tells 1147 + 1.3×n

So, here, it is indeed O(n).

Conclusion

Slicing in numba does cost something. It is O(n). But without usage of it, the compiler just remove it, and we get a O(1) operation.

Proof that the reason was, indeed, that in your version, numba code is simply doing nothing

2. Cost of the operation, whatever it is, that you do with the slice to force it to be used, and to prevent the compiler to just remove it, is way bigger, which, without statistical precautions, mask the O(n) part. Hence the feeling that "it is the same when we use the variable".

3. Anyway, numba is faster than numpy most of the time.

I mean, numpy is a good way to have "compiled language speed", without using compiled language. But it does not beat real compilation. So, it is quite classical to have a naive algorithm in numba beating a very smart vectorization in numpy. (classical, and very disappointing for someone like me, who made a living in being the guy who knowns who to vectorize things in numpy. Sometimes, I feel that with numba, the most naive nested for loops are better).

It stops being so, tho, when

Numpy make usage of several cores (you can do that with numba too. But not just with naive algorithms)
You are doing operations for which a very smart algorithm exist. Numpy algorithm have decades of optimizations. Can't beat that with 3 nested loops. Except that some tasks are so simple that they can't really be optimized.

So, I still prefer numpy over numba. Prefer to use decades of optimizations behind numpy that reinvent the wheel in numba. Plus, sometimes it is preferable not to rely on a compiler.

But, well, it is classical to have numba beating numpy.

Just, not with the ratios of your case. Because in your case, you were comparing (as I think I've proven now, and as you'd proven yourself, by seeing that numpy case was O(n) when numba case was O(1)), "doing slices with numpy vs doing nothing with numba"

That was my original thought, but I tested it out by adding a return to all the functions which I set to the product of all the currently unused variables (though I changed `n = x[1:4]` so the shapes work), and still did not see any timing difference for the numba versions. — jared, Jul 17 '23 at 13:57
I was about to accept this answer but user @jared comment is indeed true, see edits. Quite odd. — Klaus3, Jul 17 '23 at 14:57
@Klaus3 The thing is, I did exactly that also, and for the exact same reason (force the slicing code to be useful) even tho I did not post it, because 1. At fist glance it is not significative (the computation time of whatever you do with the slices to force them to exist is way more than the computation cost of the slicing itself. So then, what you measure is not what you intended to measure. It is no longer slicing time). 2. Even when removing that effect, it just confirms what I saw without it. I'll edit my message to add arguments. But I stand more than ever by my answer. — chrslg, Jul 18 '23 at 08:19