Can my loop be optimized any more?

Question

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?

I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.

Any help is appreciated!

Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)

typedef double real;

struct Particle
{
    Vector pos, vel, acc, jerk;
    Vector oldPos, oldVel, oldAcc, oldJerk;
    real mass;
};

class Vector
{
private:
    real vec[3];

public:
    // Operators defined here
};

real Gravity::interact(Particle *p, size_t numParticles)
{
    PROFILE_FUNC();
    real tau_q = 1e300;

    for (size_t i = 0; i < numParticles; i++)
    {
        p[i].jerk = 0;
        p[i].acc = 0;
    }

    for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {
            Vector r = p[j].pos - p[i].pos;
            Vector v = p[j].vel - p[i].vel;
            real r2 = lengthsq(r);
            real v2 = lengthsq(v);

            // Calculate inverse of |r|^3
            real r3i = Constants::G * pow(r2, -1.5);

            // da = r / |r|^3
            // dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
            Vector da = r * r3i;
            Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;

            // Calculate new acceleration and jerk
            p[i].acc += da * p[j].mass;
            p[i].jerk += dj * p[j].mass;
            p[j].acc -= da * p[i].mass;
            p[j].jerk -= dj * p[i].mass;

            // Collision estimation
            // Metric 1) tau = |r|^2 / |a(j) - a(i)|
            // Metric 2) tau = |r|^4 / |v|^4
            real mij = p[i].mass + p[j].mass;
            real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
            real tau_est_q2 = (r2*r2) / (v2*v2);

            if (tau_est_q1 < tau_q)
                tau_q = tau_est_q1;
            if (tau_est_q2 < tau_q)
                tau_q = tau_est_q2;
        }
    }

    return sqrt(sqrt(tau_q));
}

@Keith: Not in particular, my profiler only provides call count and timings. @Paul: I added the information to the post — Mike Bailey, Apr 27 '10 at 22:58
Try breaking your inner loop down into several small functions so that you can further pinpoint the bottleneck with your profiler. Alternatively, comment-out different chunks of code of your inner loop until you remove the bottleneck. — Emile Cormier, Apr 27 '10 at 23:57
exactly, is that pow killing you, or is the > < operators, or perhaps something else. What makes a significant difference to the time — Keith Nicholas, Apr 28 '10 at 00:46
Is there any way you can call it less? If you can call it fewer times, it will still take nearly all the time, but the total time will be reduced. — Mike Dunlavey, Apr 29 '10 at 00:16
@Mike: There's no way to call it less unfortunately. My simulator requires the calculation of gravitational interactions followed by an integration at every time step. — Mike Bailey, Apr 29 '10 at 04:40
I was just thinking, what kind of integration algorithm are you using? If you are using Euler (the simplest and most obvious) your accuracy will be poor unless you use extremely small step size (& many calls to this routine). A low-order Runge-Kutta has more calls per step, but can use a larger step size & may result in fewer calls. — Mike Dunlavey, Apr 29 '10 at 11:53
I'm not using Euler. It is available as an option (The standard, a modified, and a symplectic version) but it's not my primary integrator. I usually use Leapfrog (in the case of periodic orbits) or a 4th order Hermite scheme, which the jerk calculation is required for. I'm trying to avoid Runge-Kutta methods due to the higher number of evaluations with respect to order. — Mike Bailey, Apr 29 '10 at 13:05
I know this is late, but what about integer math techniques? Instead of real vec[3], use int64 vec[3]. The only time you need to convert from real numbers to your integer vector numbers is when you want to initialize the simulation or display the current iteration. Decide on the # of decimal places you need for accuracy. Simply scale input by 1^d, where d is the # of decimals, when creating a particle. Scale by 1/1^d to display current the particle. — johnnycrash, Feb 06 '13 at 21:46
@johnnycrash: I have considered integer arithmetic, but that would lose an unacceptably large amount of precision. — Mike Bailey, Feb 09 '13 at 03:27

Ira Baxter · Accepted Answer · 2010-04-28T17:01:46.050

23

Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.

4a. Try replacing the if (...) tau_q = with

    tau_q=minimum(...,...)

Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.

4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]

[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.

Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

edited Apr 28 '10 at 17:01

answered Apr 27 '10 at 22:47

Ira Baxter

93,541
22
172
341

3

Or better, `r2*r2*inverse_sqrt(r2)` (choose your platform's appropriate function) avoids division entirely. Or write `r2*r2/sqrt(r2)`, use your compiler's loose math setting, and check the assembly output or just benchmark. – Potatoswatter Apr 27 '10 at 22:53
I should have mentioned earlier the calls to lengthsq() and other Vector operations are all defined as inline. I will try your second and third suggestion though. Could I get around #3 by telling my compiler that the pointers are not aliased, or would it provide a larger speedup? – Mike Bailey Apr 27 '10 at 22:55
1

@Potatoswatter: good idea, but that's `pow(r2, +1.5)` - it needs to be `pow(r2, -1.5)` – Paul R Apr 27 '10 at 22:57
@Sagekilla: you can try telling you compiler that that pointers aren't aliased. The amount of recoding to use the scalars is pretty minimal, and makes you independent of the whatever the crazy interpretation the compiler gives to that switch. When it doubt, play it safe. – Ira Baxter Apr 27 '10 at 23:05
2

@Paul, d'oh! Well, `inverse_sqrt(r2)/r2` then. – Potatoswatter Apr 27 '10 at 23:11
You needn't use assembly to go SIMD; most compilers offer intrinsic functions, which look like C function calls but actually translate to a native opcode. – Crashworks Apr 27 '10 at 23:29
+1 for aliasing. It might be (minorly) helpful to save Constants::G as a local variable too. – celion Apr 27 '10 at 23:36
@Celion: Constants::G presumably *is* a constant. How will putting a constant into a local variable do anything but slow the code down? – Ira Baxter Apr 27 '10 at 23:46
I suppose it depends on the compiler and what Constants::G actually is. The compiler might not be able to assume that &p[i] != &Constants::G, and thus have to reload Constants::G after every memory write. I can't find the code snippet now, but a global variable like this is the specific example I've seen introducing the motivation for the __restrict keyword. – celion Apr 28 '10 at 00:05
@Ira: What exactly is loop folding? I did some google searching but I couldn't find what it is you're talking about here. – Mike Bailey Apr 28 '10 at 13:25
@Sagekilla: it means merging the bodies of the loops. You can trivially move the initialization out of your first loop, into the head of the second one. – Ira Baxter Apr 28 '10 at 14:02
@Ira: I don't think I can actually do that. Inside the second pair of for loops I accumulate the acceleration and jerk on p[i] and p[j]. If I pushed the first loop into the body of the second loop I would get incorrect acceleration and jerk, due to how I calculate them. – Mike Bailey Apr 28 '10 at 14:11
@Sagekilla: I think I missed the fact that you are updating p[j].jerk in the loop by misreading it as p[i].jerk. In that case, you're right, you can't do that. You should still use the scalar trick in the loop. – Ira Baxter Apr 28 '10 at 14:51

Michael Dorgan · Answer 2 · 2010-04-27T23:25:47.767

This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.

If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.

Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.

Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.

Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.

One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)

Yes, my calculations are precise. The tau_q calculation is actually necessary the way it is, due to dimensional analysis. The tau_est_q1 and tau_est_q2 calculations are designed that way so I get the proper units of seconds to the fourth power, which would be converted to seconds at the end. — Mike Bailey, Apr 27 '10 at 23:07
Culling unfortunately is not possible, as this is a pure gravity simulation. All pairwise interactions have to be accounted for. I will at some point try to attain O(n logn) by using the Barnes-Hut algorithm, but my main issue will remain: Can my calculations inside the loop be optimized still? — Mike Bailey, Apr 27 '10 at 23:12

Paul R · Answer 3 · 2010-04-27T22:58:46.380

3

Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.
You might consider whether you can use floats rather than doubles.
If you're using gcc then make sure you're using -O2 or possibly -O3.
You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).
Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).

edited Apr 27 '10 at 22:58

answered Apr 27 '10 at 22:48

Paul R

208,748
37
389
560

I've looked into the possibility of using floats, which I've decided to take account for by providing the real typedef. In this particular case, I need high speed and accuracy that utilizes full machine precision (to 1E-15) – Mike Bailey Apr 27 '10 at 22:52
As a quick test - try `long double`, just to compare performance with `double`. You never know, it could conceivably be faster. – Steve Jessop Apr 28 '10 at 00:39

score 3 · Answer 4 · answered Apr 27 '10 at 23:39

If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.

celion · Answer 5 · 2010-04-28T00:19:44.087

Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.

If you move the mass out too, you could store things like this:

struct ParticleData
{
    Vector pos, vel, acc, jerk;
};

ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...

then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.

If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g

struct ParticleData
{
    // data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
    Vector4 data_x;

    // data_y[0] == pos.y, data_y[1] = vel.y, etc.
    Vector4 data_y;

    // data_z[0] == pos.z, data_y[1] = vel.z, etc.
    Vector4 data_z;
};

where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)

Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).

Thank you! That's an interesting idea, I'll have to give this a try. Although my code doesn't explicit state it, the oldData is used in my integration functions where I need to keep track of older data. I will try splitting my data and doing that memcpy. — Mike Bailey, Apr 28 '10 at 05:09
The memcpy is a small benefit and only O(N). Reducing cache misses in the inner loop is where you'll see the main gain (I hope). — celion, Apr 28 '10 at 06:42

score 2 · Answer 6 · answered Apr 28 '10 at 00:19

My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.

With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.

score 2 · Answer 7 · answered Apr 29 '10 at 04:26

2

All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.

SSE example:

http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp

Novel cache algorithm, explanation and example code:

http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/

http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp

You might also find the following deck I gave at Seattle Code Camp interesting:

http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/

Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.

answered Apr 29 '10 at 04:26

Ade Miller

13,575
1
42
75

Thanks a lot! Those links were extremely useful. I already have another version that's parallelized where I looped n^2 (instead of (n^2 - n) / 2) times so that I could accumulate the ith particle using force of the jth particle. That version had no race conditions and no need for locking because of how I wrote it. – Mike Bailey Apr 29 '10 at 04:36
I just took a look at your NBody code (Are you also reading Art of Computational Science?) and I have one small recommendation: Try writing your integrators as a predict and correct step, and pull out your force calculation. Then you can write it as a PEC scheme (Predict, Evaluate, Correct). – Mike Bailey Apr 29 '10 at 04:51
That's how mine are written too. The novel algorithm gets you back to (n^2 - n) / 2) for parallel code and also lets you fit blocks of the problem into L1 cache reducing cache misses. This made a significant difference for me - nearly 2x. At some point cache misses become by far your biggest overhead. – Ade Miller Apr 29 '10 at 04:55
Yes, I've read chucks of the Art of Comp Sci. I should really get back to it. Currently everything uses a global timestep. I should play around with that some more and move to individual timesteps. I sort of got sidetracked by the whole CUDA thing which is fast but hard to code for. I have a Hermite integrator in C#, maybe I'll play around with that. Thanks! – Ade Miller Apr 29 '10 at 05:03

score 1 · Answer 8 · answered Apr 27 '10 at 22:50

1

Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?

What's real? Can it be a float?

Apart from that you'll have to turn to MMX/SSE/assembly optimisations.

answered Apr 27 '10 at 22:50

AshleysBrain

22,335
15
88
124

2

`real` is synonymous for `double` in this code. See the `typedef` at the top. – kurige Apr 27 '10 at 23:06

score 1 · Answer 9 · answered Apr 27 '10 at 23:38

1

Would you benefit from the famous "fast inverse square root" algorithm?

float InvSqrt(float x)
{
    union {
        float f;
        int i;
    } tmp;
    tmp.f = x;
    tmp.i = 0x5f3759df - (tmp.i >> 1);
    float y = tmp.f;
    return y * (1.5f - 0.5f * x * y * y);
}

It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.

answered Apr 27 '10 at 23:38

Escualo

40,844
23
87
135

1

This is not an optimization on almost any current architecture. If you need a fast approximate inverse square root, use a dedicated hardware instruction to get it. (`rsqrtss` on x86 with SSE, `vrsqrte.f32` on ARM with NEON, `vrsqrtefp` on PowerPC with AltiVec). They're all both faster and more accurate. – Stephen Canon Apr 28 '10 at 17:59
I can verify what Stephen claims. I did some timings of the invsqrt and sqrt then divide, and the invsqrt was not only marginally faster but far more inaccurate. For a similar level of accuracy (by adding more iterations of Newton's method -- That last return step) invsqrt was slower. – Mike Bailey Apr 29 '10 at 04:42

score 1 · Answer 10 · answered Apr 28 '10 at 04:48

Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.

Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.

Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.

score 0 · Answer 11 · answered Apr 27 '10 at 22:45

0

Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.

Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.

You may want to profile to see if this is really the bottleneck first.

answered Apr 27 '10 at 22:45

Joshua

40,822
8
72
132

2

Err.. doing it wrong? You assume that the average assembly programmer is better than a compiler -- which is usually not true. – Billy ONeal Apr 27 '10 at 22:47
I have profiled my code and can verify this code does account for 99% of the execution time. I was just wondering if there are any top level optimizations left before I start using assembly. – Mike Bailey Apr 27 '10 at 22:49
1

If you have profiled, is there a line that takes much more than the others? – Vicente Botet Escriba Apr 27 '10 at 23:14
@Billy I don't know about the average developer but I can almost always take the compiler's output and squeeze a few more % out of it. – Joshua Apr 28 '10 at 02:41
@Johsua: Really? http://blogs.msdn.com/oldnewthing/archive/2004/12/16/317157.aspx <-- Prime example of why you should leave this to compilers. – Billy ONeal Apr 28 '10 at 03:16
Never optimize w/o profiling before and after. With something like this one, I'm likely to optimize registers across function calls, something the compiler cannot do. – Joshua Apr 28 '10 at 03:37

score 0 · Answer 12 · answered Apr 27 '10 at 22:57

Thing I look for is branching, they tend to be performance killers.

You can use loop unrolling.

also, remember multiple with smaller parts of the problem :-

  for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {

is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache

You could thread this to make better use of multiple cores

you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....

but really need to know where its costing you the most

score 0 · Answer 13 · answered Apr 28 '10 at 00:19

You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.

Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {
            r = p[j].pos - p[i].pos;
            v = p[j].vel - p[i].vel;
            r2 = lengthsq(r);
            v2 = lengthsq(v);

            // Calculate inverse of |r|^3
            r3i = Constants::G * pow(r2, -1.5);

            // da = r / |r|^3
            // dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
            da = r * r3i;
            dj = (v - r * (3 * dot(r, v) / r2)) * r3i;

            // Calculate new acceleration and jerk
            p[i].acc += da * p[j].mass;
            p[i].jerk += dj * p[j].mass;
            p[j].acc -= da * p[i].mass;
            p[j].jerk -= dj * p[i].mass;

            // Collision estimation
            // Metric 1) tau = |r|^2 / |a(j) - a(i)|
            // Metric 2) tau = |r|^4 / |v|^4
            mij = p[i].mass + p[j].mass;
            tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
            tau_est_q2 = (r2*r2) / (v2*v2);

            if (tau_est_q1 < tau_q)
                tau_q = tau_est_q1;
            if (tau_est_q2 < tau_q)
                tau_q = tau_est_q2;
        }
    }

As there is no constructor and no members, that vector is a POD and as such, there shouldn't be any initialization code right? One would hope that the compiler would be smart enough to 'register' those values. But you never know... — Michael Dorgan, Apr 28 '10 at 00:25
yes, I'd hope the compiler would perform this. But, better safe than sorry, in my experience. It's generally good practice to re-use reusable variables anyway - if he wrote another loop that involved, say, strings, it's a good habit to get into. — Puppy, Apr 28 '10 at 11:14
It's 2010. There's absolutely no need to worry about reusing variables. In fact, if you're using a modern compiler, and reuse a variable, it may even internally represent the reuse as a new variable. There's also no need to do something that makes code less readable and provides no performance benefit. — sigfpe, Apr 28 '10 at 18:11

score 0 · Answer 14 · answered Apr 28 '10 at 18:07

You can replace any occurrence of:

a = b/c
d = e/f

with

icf = 1/(c*f)
a = bf*icf
d = ec*icf

if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.

You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.

Can my loop be optimized any more?

14 Answers14