Problem with memory allocation in Julia code

Question

I used a function in Python/Numpy to solve a problem in combinatorial game theory.

import numpy as np
from time import time

def problem(c):
    start = time()
    N = np.array([0, 0])
    U = np.arange(c)
    
    for _ in U:
        bits = np.bitwise_xor(N[:-1], N[-2::-1])
        N = np.append(N, np.setdiff1d(U, bits).min())

    return len(*np.where(N==0)), time()-start 

problem(10000)

Then I wrote it in Julia because I thought it'd be faster due to Julia using just-in-time compilation.

function problem(c)
    N = [0]
    U = Vector(0:c)
    
    for _ in U
        elems = N[1:length(N)-1]
        bits = elems .⊻ reverse(elems)
        push!(N, minimum(setdiff(U, bits))) 
    end
    
    return sum(N .== 0)
end

@time problem(10000)

But the second version was much slower. For c = 10000, the Python version takes 2.5 sec. on an Core i5 processor and the Julia version takes 4.5 sec. Since Numpy operations are implemented in C, I'm wondering if Python is indeed faster or if I'm writing a function with wasted time complexity.

The implementation in Julia allocates a lot of memory. How to reduce the number of allocations to improve its performance?

Asking yourself "is language X faster than language Y?" is usually a red herring. Aside from technicalities such as different implementations, most languages aren't used "pure" – like your Python program calling into compiled languages' code via ``numpy`` – and there's no free lunch: a JIT gains long-term performance at the cost of short-term warmup. *Your specific program* in Julia is slower than *your specific program* in Python. — MisterMiyagi, Jan 19 '22 at 06:45
Extrapolating "Is X better then Y" from 1 datapoint is kinda dangerous, don't you think? — Patrick Artner, Jan 19 '22 at 06:47
I don't think this should be closed. it needs some editing, but the specific comparison is on topic. — Oscar Smith, Jan 19 '22 at 06:54
Please do not rely on links to provide necessary information – for example, the Python code should be in the question itself. You can [edit] your question at any time to adjust it according to the [ask] and [mre] help pages; the question will go through a re-open process once it is sufficiently edited. — MisterMiyagi, Jan 19 '22 at 07:03
The code in Julia is not written efficiently. The major inefficiency is that it makes a lot of data copying and allocations (note that in Python indexing creates a view as opposed to Julia where you have to opt-in for this). I have re-written it to get around 40x speed improvement in Julia execution time. I have edited the question to reflect this and voted to re-open — Bogumił Kamiński, Jan 19 '22 at 07:22
@Bogumił Kamiński How do you know that the copying and allocations are the major inefficiency, not the `setdiff`? — Kelly Bundy, Jan 19 '22 at 08:15
Performance of `setdiff` could be different between Python and Julia, but I do not see why Julia should be slower here - this is a basic library function that is likely optimized. If it is then it should be fixed in Julia. I do not have Python installed on my current machine to make a benchmark. On the other hand it was clear from the original code that it was making unnecessary allocations in several places thus my comment. — Bogumił Kamiński, Jan 19 '22 at 08:31
@Bogumił Kamiński Duplicating the `setdiff` call in the Julia solution increases the reported allocations about as much as doing the `bits = ...` six times instead of once. But increases the time much more (from ~9 seconds to ~ 17.5 seconds, instead of just to ~10.5 seconds). So it does look to me like the "computational" cost of `setdiff` is the bigger culprit. I know almost nothing about Julia, though, so might misjudge things :-) — Kelly Bundy, Jan 19 '22 at 09:00
Indeed it seems that 70% of time is spent in `setdiff` in Julia (a rough benchmark). — Bogumił Kamiński, Jan 19 '22 at 09:37
my guess is that we're missing an important optimization for setdiff. — Oscar Smith, Jan 19 '22 at 13:11

Bogumił Kamiński · Accepted Answer · 2022-01-19T09:34:21.590

17

The original code can be re-written in the following way:

function problem2(c)
    N = zeros(Int, c+2)
    notseen = falses(c+1)

    for lN in 1:c+1
        notseen .= true
        @inbounds for i in 1:lN-1
            b = N[i] ⊻ N[lN-i]
            b <= c && (notseen[b+1] = false)
        end
        idx = findfirst(notseen)
        isnothing(idx) || (N[lN+1] = idx-1)
    end
    return count(==(0), N)
end

First check if the functions produce the same results:

julia> problem(10000), problem2(10000)
(1475, 1475)

(I have also checked that the generated N vector is identical)

Now let us benchmark both functions:

julia> using BenchmarkTools

julia> @btime problem(10000)
  4.938 s (163884 allocations: 3.25 GiB)
1475

julia> @btime problem2(10000)
  76.275 ms (4 allocations: 79.59 KiB)
1475

So it turns out to be over 60x faster.

What I do to improve the performance is avoiding allocations. In Julia it is easy and efficient. If any part of the code is not clear please comment. Note that I concentrated on showing how to improve the performance of Julia code (and not trying to just replicate the Python code, since - as it was commented under the original post - doing language performance comparisons is very tricky). I think it is better to concentrate in this discussion on how to make Julia code fast.

EDIT

Indeed changing to Vector{Bool} and removing the condition on b and c relation (which mathematically holds for these values of c) gives a better speed:

julia> function problem3(c)
           N = zeros(Int, c+2)
           notseen = Vector{Bool}(undef, c+1)

           for lN in 1:c+1
               notseen .= true
               @inbounds for i in 1:lN-1
                   b = N[i] ⊻ N[lN-i]
                   notseen[b+1] = false
               end
               idx = findfirst(notseen)
               isnothing(idx) || (N[lN+1] = idx-1)
           end
           return count(==(0), N)
       end
problem3 (generic function with 1 method)

julia> @btime problem3(10000)
  20.714 ms (3 allocations: 88.17 KiB)
1475

edited Jan 19 '22 at 09:34

answered Jan 19 '22 at 07:43

Bogumił Kamiński

66,844
3
80
107

Hmm, now I'm curious about the speed of an equally optimized Python solution, though. – Kelly Bundy Jan 19 '22 at 07:49
2

The `for i in 1:lN-1` loop would have to be written in e.g. Numba or Cython. I do not think it is easily doable in Python (but I might be wrong). – Bogumił Kamiński Jan 19 '22 at 07:52
1

I don't mean necessarily rewritten the exact same way, and am more interested in a version with just NumPy. I just think the setdiff takes most of the time and it could be replaced similarly (but still with NumPy operations). – Kelly Bundy Jan 19 '22 at 08:11
1

`notseen[b+1] = (b > c) && notseen[b+1]` moves the branch, maybe making this slightly faster. – phipsgabler Jan 19 '22 at 08:34
2

Ha! Using `notseen = fill(false, c+1)` allows you to use `notseen[b+1] = (b > c) & notseen[b+1]` without the overhead from bit twiddling, cutting the time in half at only slight memory overhead. – phipsgabler Jan 19 '22 at 08:42
the `b > c` check is needed in general as `notseen[b+1]` might be out-of-bounds. If we were sure that this never happens then the condition can be removed (I have not checked mathematically if this is the case). Actually I now see that I have a bug and it should be `b+1 <= c`. – Bogumił Kamiński Jan 19 '22 at 08:54
Well, within `@inbounds` you better know you're in bounds for sure :D But anyway, of course it's better to actually think about what the algorithm does instead of blindly looking at the code like I did. – phipsgabler Jan 19 '22 at 08:58
Anyway, just using `Vector{Bool}` instead of `BitVector` gives a speedup of 2.5x for me, at a cost of 10% more allocation. – DNF Jan 19 '22 at 09:07
`notseen` always starts with c+1 trues and you set at most c to false, right? Why the isnothing check? – Kelly Bundy Jan 19 '22 at 09:58
Yes - as per comment of @DNF - it could be omitted after the mathematical analysis of the problem. But I was just translating the code (without analyzing the meaning) so I added the check. However, removing it should not have a significant performance impact as it is outside of the inner loop. – Bogumił Kamiński Jan 19 '22 at 10:07
You mean `b <= c`? I don't know about that one. But`c < c+1` seems trivial. Didn't mention it for speed but for brevity/clarity, I tend to find stuff like that misleading (checking something suggests that it can happen). – Kelly Bundy Jan 19 '22 at 10:18
Yes - it is trivial if you think about the meaning of the code. I was doing just code translation without thinking about it (I should probably have done this :)). – Bogumił Kamiński Jan 19 '22 at 10:26
2

On my laptop: python: 2.6 sec, julia: `problem`: 4.4 sec, `problem2`: 0.09, `problem3`: 0.04 – Antonello Jan 19 '22 at 13:25

Problem with memory allocation in Julia code

1 Answers1

EDIT