3

I am implementing the Runge-Kutta-Fehlberg method with adaptive step-size (RK45). I define and call my Butcher tableau in a notebook with

module FehlbergTableau
    using StaticArrays
    export A, B, CH, CT

    A = @SVector [ 0 , 2/9 , 1/3 , 3/4 , 1 , 5/6 ]

    B = @SMatrix [ 0          0         0        0        0
                   2/9        0         0        0        0
                   1/12       1/4       0        0        0
                   69/128    -243/128   135/64   0        0
                  -17/12      27/4     -27/5     16/15    0
                   65/432    -5/16      13/16    4/27     5/144 ]

    CH = @SVector [ 47/450 , 0 , 12/25 , 32/225 , 1/30 , 6/25 ]

    CT = @SVector [ -1/150 , 0 , 3/100 , -16/75 , -1/20 , 6/25 ]
end

using .FehlbergTableau

If I code the algorithm for RK45 straightforwardly as

function infinitesimal_flow(A::SVector{6,Float64}, B::SMatrix{6,5,Float64}, CH::SVector{6,Float64}, CT::SVector{6,Float64}, t0::Float64, Δt::Float64, J∇H::Function, x0::SVector{N,Float64}) where N
    
    k1 = Δt * J∇H( t0 + Δt*A[1], x0 ) 
    k2 = Δt * J∇H( t0 + Δt*A[2], x0 + B[2,1]*k1 ) 
    k3 = Δt * J∇H( t0 + Δt*A[3], x0 + B[3,1]*k1 + B[3,2]*k2 )
    k4 = Δt * J∇H( t0 + Δt*A[4], x0 + B[4,1]*k1 + B[4,2]*k2 + B[4,3]*k3 )
    k5 = Δt * J∇H( t0 + Δt*A[5], x0 + B[5,1]*k1 + B[5,2]*k2 + B[5,3]*k3 + B[5,4]*k4 )
    k6 = Δt * J∇H( t0 + Δt*A[6], x0 + B[6,1]*k1 + B[6,2]*k2 + B[6,3]*k3 + B[6,4]*k4 + B[6,5]*k5 )
    
    TE = CT[1]*k1 + CT[2]*k2 + CT[3]*k3 + CT[4]*k4 + CT[5]*k5 + CT[6]*k6  
    xt = x0 + CH[1]*k1 + CH[2]*k2 + CH[3]*k3 + CH[4]*k4 + CH[5]*k5 + CH[6]*k6 
    
    norm(TE), xt
end                  

and compare it with the more compact implementation

function infinitesimal_flow_2(A::SVector{6,Float64}, B::SMatrix{6,5,Float64}, CH::SVector{6,Float64}, CT::SVector{6,Float64}, t0::Float64,Δt::Float64,J∇H::Function, x0::SVector{N,Float64}) where N
    
    k = MMatrix{N,6}(0.0I)
    TE = zero(x0); xt = x0
    
    for i=1:6
        # EDIT: this is wrong! there should be a new variable here, as pointed 
        # out by Lutz Lehmann: xs = x0
        for j=1:i-1
            # xs += B[i,j] * k[:,j]
            x0 += B[i,j] * k[:,j] #wrong
        end
        k[:,i] = Δt *  J∇H(t0 + Δt*A[i], x0) 

        TE += CT[i]*k[:,i]
        xt += CH[i]*k[:,i]B[i,j] * k[:,j]
    end
    norm(TE), xt
end

Then the first function, which defines variables explicitly, is much faster:

J∇H(t::Float64, X::SVector{N,Float64}) where N = @SVector [ -X[2]^2, X[1] ]
x0 = SVector{2}([0.0, 1.0])
infinitesimal_flow(A, B, CH, CT, 0.0, 1e-2, J∇H, x0)
infinitesimal_flow_2(A, B, CH, CT, 0.0, 1e-2, J∇H, x0)

@btime infinitesimal_flow($A, $B, $CH, $CT, 0.0, 1e-2, $J∇H, $x0)
>> 19.387 ns (0 allocations: 0 bytes)
@btime infinitesimal_flow_2($A, $B, $CH, $CT, 0.0, 1e-2, $J∇H, $x0)
>> 50.985 ns (0 allocations: 0 bytes)

I cannot find a type instability or anything to justify the lag, and for more complex tableaus it is mandatory that I use the algorithm in loop form. What am I doing wrong?

P.S.: The bottleneck in infinitesimal_flow_2 is the line k[:,i] = Δt * J∇H(t0 + Δt*A[i], x0).

QuantumBrick
  • 249
  • 1
  • 8
  • A slowdown of 2.63x is relatively tame. Your `@btime` results suggest the compiler eliminated allocation of the `MMatrix` and its slices, and `J∇H` runs 6 times on the same argument types in both functions, so I wouldn't say those are the reason. The only remaining thing I can think of is the for-loops adding a bit of overhead and making it harder for the compiler to optimize (like it could recognize `xt = x0 + TE` in the unrolled version), but that's so small I can't say for certain if it would 2.63x the runtime. If only reading and comparing `@code_native` were easy! – BatWannaBe Sep 07 '21 at 19:19
  • @BatWannaBe I also though it was tame, but this function is applied inside a while loop trillions of times (literally). For 12 degrees of freedom the slowdown is insane: employing `infinitesimal_flow_2` is one thousand times slower than using `infinitesimal_flow`. Something is very wrong here! – QuantumBrick Sep 07 '21 at 19:41
  • What do you mean by degrees of freedom, exactly? N? The 6 being 12 instead? If it increases your `MMatrix` size, that could be why it became 1000x slower; allocation elimination is always kind of a mystery and it's possible the compiler stops doing it for larger sizes. Maybe you should post the 12 degrees of freedom version and its `@btime` results instead? "1000x" slowdown would be more eye-catching too. – BatWannaBe Sep 07 '21 at 19:48
  • @BatWannaBe No, the objects in the tableau are immutable. By degrees of freedom I mean changing the size of `x0` (that is, `N` is the thing that changes). In the current example there is a single degree of freedom (two dimensional phase space), but for 6 phase space is 12 dimensional. I will modify my example to 3 or 4 degrees of freedom tomorrow, so the 2.63x slowdown will become something more striking. – QuantumBrick Sep 07 '21 at 19:51
  • I'm not sure why you say it's immutable. `MMatrix` is a mutable type and is heap-allocated, the docs say so too. You also mutate it in the line `k[:,i] = Δt * J∇H(t0 + Δt*A[i], x0)`. My point was just that heap allocations are a big source of slowdown, and making a mutable Array `k` and all the copied slices `k[:,i]` would normally heap-allocate a bunch. The compiler did a fair bit of work to reduce that to "0 allocations" (implies on the heap) in your current post's example. – BatWannaBe Sep 07 '21 at 20:03
  • @BatWannaBe you're right, the matrix does enlarge. It's just that not because of the 6 (this index is fixed), but because of the N. I agree with you now. – QuantumBrick Sep 07 '21 at 20:09
  • Check if both methods return the same results, as far as I can see the continuous accumulation of x0 in the second one is wrong, you need to reset x0 in each stage. Or better use an additional `xs = x0`, loop `xs += B[i,j] * k[:,j]`, then `k[:,i] = Δt * J∇H(t0 + Δt*A[i], xs)`. – Lutz Lehmann Sep 08 '21 at 06:00
  • @LutzLehmann You are totally right. An older version of this algorithm defined a new variable within the loop, but I took it out carelessly. This comment was actually the answer to my problem, because infinitesimal_flow_2 was returning a wrong propagated point and the while loop calling it was dramatically reducing the step-size to converge. It achieved convergence, but with thousands more iterations. My problems are over. Would you please post an answer, even if a one liner, so that the question can be solved? (Or maybe I should delete it, since the title is wrong.. that wasn't the problem) – QuantumBrick Sep 08 '21 at 06:27
  • 1
    Something else: from the unroll version it seems that you don't need the zeroed values of the matrix k. It's possible to define it that way: `k = MMatrix{N,6, Float64, 6*N}(undef)`. It doesn't seem much on your example but the gain is more significant for larger matrices. – Thomas Jalabert Sep 09 '21 at 00:06

1 Answers1

3

Each stage of the RK method computes its evaluation point directly from the base point of the RK step. This is explicit in the first method. In the second method you would have to reset the point computation in each stage, such as in

    for i=1:6
        xs = x0
        for j=1:i-1
            xs += B[i,j] * k[:,j]
        end
        k[:,i] = Δt *  J∇H(t0 + Δt*A[i], xs) 
        ...

The slightest error in the step computation can catastrophically throw off the step-size controller, forcing the step size to fall towards zero and thus the effort to increase drastically. An example is the 4101 error in RKF45

Lutz Lehmann
  • 25,219
  • 2
  • 22
  • 51
  • Interestingly, the corrected function was sometimes slower in a single evaluation that the one written explicitly. However, when included inside the loop, somehow it was almost always faster. Is there something "more stable" about `infinitesimal_flow_2` when compared to `infinitesimal_flow`? – QuantumBrick Sep 08 '21 at 06:30
  • The first has "loop unrolling" in its explicit formulation. The second has a certain overhead in allocating the extra state variable. If used in a larger loop, the compiler might inline this and sort the allocations to occur only once. Then the loop formulation might give more flexibility to compiler optimizations? – Lutz Lehmann Sep 08 '21 at 06:35
  • That might indeed be happening. I often forget LLVM is incredible, and how much smarter than me the compiler is. – QuantumBrick Sep 08 '21 at 07:31