Fastest way to compute distance squared

Question

My code relies heavily on computing distances between two points in 3D space. To avoid the expensive square root I use the squared distance throughout. But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster. I now have:

double distance_squared(double *a, double *b)
{
  double dx = a[0] - b[0];
  double dy = a[1] - b[1];
  double dz = a[2] - b[2];

  return dx*dx + dy*dy + dz*dz;
}

I also tried using a macro to avoid the function call but it doesn't help much.

#define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) + ((a)[1]-(b)[1])*((a)[1]-(b)[1]) + ((a)[2]-(b)[2])*((a)[2]-(b)[2])

I thought about using SIMD instructions but could not find a good example or complete list of instructions (ideally some multiply+add on two vectors).

GPU's are not an option since only one set of points is known at each function call.

What would be the fastest way to compute the distance squared?

If your points are stationary precomputing (all?) distance pairs might be beneficial. — schot, Nov 10 '11 at 10:43
you have to optimize somewhere else.. this should already be optimal (as david says). maybe you can provide a broader view of your problem..maybe you don't have to recalculate everything or use other tricks.. — duedl0r, Nov 10 '11 at 10:45
What are you trying to compute overall. This is obviously the bottom of some algorithm but are you trying to find the closes of two points in a set of n or something else. Maybe if you describe what your actual trying to compute there could be a alternative algorithm rather than brute force. — Martin York, Nov 10 '11 at 18:25
The points are not stationary so precomputing is not possible. I am already considering other algorithms. This is not part of a brute force code (kdtree implementation) but still needs to be computed often enough so that optimizing it would help. I was just wondering what would be the fastest way to do it in general since I could not find any information on the subject. — Pim Schellart, Nov 13 '11 at 10:49

score 11 · Accepted Answer · answered Nov 10 '11 at 10:35

11

A good compiler will optimize that about as well as you will ever manage. A good compiler will use SIMD instructions if it deems that they are going to be beneficial. Make sure that you turn on all such possible optimizations for your compiler. Unfortunately, vectors of dimension 3 don't tend to sit well with SIMD units.

I suspect that you will simply have to accept that the code produced by the compiler is probably pretty close to optimal and that no significant gains can be made.

answered Nov 10 '11 at 10:35

David Heffernan

601,492
42
1,072
1,490

4

exactly. e.g with gcc and `-O3 -march=native` this should usually produce code that is not too bad. Check the assembler with `-S` before you invest a lot of time in optimizing. – Jens Gustedt Nov 10 '11 at 11:07

Damon · Answer 2 · 2011-11-10T11:12:09.077

8

The first obvious thing would be to use the restrict keyword.

As it is now, a and b are aliasable (and thus, from the compiler's point of view which assumes the worst possible case are aliased). No compiler will auto-vectorize this, as it is wrong to do so.

Worse, not only can the compiler not vectorize such a loop, in case you also store (luckily not in your example), it also must re-load values each time. Always be clear about aliasing, as it greatly impacts the compiler.

Next, if you can live with that, use float instead of double and pad to 4 floats even if one is unused, this is a more "natural" data layout for the majority of CPUs (this is somewhat platform specific, but 4 floats is a good guess for most platforms -- 3 doubles, a.k.a. 1.5 SIMD registers on "typical" CPUs, is not optimal anywhere).

(For a hand-written SIMD implementation (which is harder than you think), first and before all be sure to have aligned data. Next, look into what latencies your instrucitons have on the target machine and do the longest ones first. For example on pre-Prescott Intel it makes sense to first shuffle each component into a register and then multiply with itself, even though that uses 3 multiplies instead of one, because shuffles have a long latency. On the later models, a shuffle takes a single cycle, so that would be a total anti-optimization.
Which again shows that leaving it to the compiler is not such a bad idea.)

edited Nov 10 '11 at 11:12

answered Nov 10 '11 at 10:56

Damon

67,688
20
135
185

Why will aliasing prevent vectorization? – avakar Nov 10 '11 at 11:38
4

I don't see how `restrict` will help here - as far as I can tell, nothing in the function body will benefit from additional aliasing information; if you're aggressive about code annotation, you should mark the parameters `const double *restrict` as they are never modified... – Christoph Nov 10 '11 at 11:56
Vectorization means reading and writing several independent values in one go, and processing them in one operation. The compiler can only do that on independent data if there are no other pointers/references or if it can prove without any doubt that no other pointers/references access any of the data during the same scope (this is sometimes, possible), or if the programmer explicitly _promises_ (`restrict` keyword) that he has thought of that and _it won't happen_. In any other case, it's unsafe, and a compiler will generally refuse to do something that _might_ give incorrect results. – Damon Nov 10 '11 at 13:05
As Christoph correctly pointed out, the data is never modified in this function, which admittedly a clever compiler might indeed figure out and be able to prove that it's safe to vectorize (if nothing else prevents it), but I would not wager my right hand for that. In any case, declaring a pointer that is not aliased as `restrict` is correct (as is declaring a constant pointer `const`). Being precise on declarations is not just cosmetic, it is a kind of "code correctness", too. – Damon Nov 10 '11 at 13:10
You need a clever compiler anyway to benefit from `restrict`. That keyword works by eliminating certain barriers to optimization, barriers that just don't exist here in the first place. – MSalters Nov 10 '11 at 14:00

score 4 · Answer 3 · answered Nov 10 '11 at 10:58

The SIMD code to do this (using SSE3):

movaps xmm0,a
movaps xmm1,b
subps xmm0,xmm1
mulps xmm0,xmm0
haddps xmm0,xmm0
haddps xmm0,xmm0

but you need four value vectors (x,y,z,0) for this to work. If you've only got three values then you'd need to do a bit of fiddling about to get the required format which would cancel out any benefit of the above.

In general though, due to the superscalar pipelined architecture of the CPU, the best way to get performance is to do the same operation on lots of data, that way you can interleave the various steps and do a bit of loop unrolling to avoid pipeline stalls. The above code will definately stall on the last three instructions based on the "can't use a value directly after it's modified" principle - the second instruction has to wait for the result of the previous instruction to complete which isn't good in a pipelined system.

Doing the calculation on two or more different sets points of points at the same time can remove the above bottleneck - whilst waiting for the result of one computation, you can start the computation of the next point:

movaps xmm0,a1
                  movaps xmm2,a2
movaps xmm1,b1
                  movaps xmm3,b2
subps xmm0,xmm1
                  subps xmm2,xmm3
mulps xmm0,xmm0
                  mulps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2

Could you add C Code using SSE to do the first line assuming the fourth value of each #D point is is zero? Thank You. — Royi, Feb 06 '17 at 19:36

score 3 · Answer 4 · answered Nov 10 '11 at 11:10

3

If you would like to optimize something, at first profile code and inspect assembler output.

After compiling it with gcc -O3 (4.6.1) we'll have nice disassembled output with SIMD:

movsd   (%rdi), %xmm0
movsd   8(%rdi), %xmm2
subsd   (%rsi), %xmm0
movsd   16(%rdi), %xmm1
subsd   8(%rsi), %xmm2
subsd   16(%rsi), %xmm1
mulsd   %xmm0, %xmm0
mulsd   %xmm2, %xmm2
mulsd   %xmm1, %xmm1
addsd   %xmm2, %xmm0
addsd   %xmm1, %xmm0

answered Nov 10 '11 at 11:10

tensai_cirno

914
1
7
13

2

Although it uses SSE2 instructions, this is not SIMD code. All the instructions operate on single values. mulsd mean "Scalar Double-Precision Floating-Point Multiply", the multiple data version is mulpd "Packed Double-Precision Floating-Point Multiply". – Giacomo Verticale Nov 10 '11 at 17:11
1

I have checked and this is also what I get. So the question becomes: "How can I write the code such that gcc will compile it into SIMD code, or do I need to write that by hand and if so how?". – Pim Schellart Nov 13 '11 at 10:50

score 1 · Answer 5 · answered Nov 10 '11 at 13:07

This type of problem often occurs in MD simulations. Usually the amount of calculations is reduced by cutoffs and neighbor lists, so the number for the calculation is reduced. The actual calculation of the squared distances however is exactly done (with compiler optimizations and a fixed type float[3]) as given in your question.

So if you want to reduce the amount of squared calculations you should tell us more about the problem.

score 0 · Answer 6 · answered Nov 10 '11 at 10:32

0

Perhaps passing the 6 doubles directly as arguments could make it faster (because it could avoid the array dereference):

inline double distsquare_coord(double xa, double ya, double za, 
                               double xb, double yb, double zb) 
{ 
  double dx = xa-yb; double dy=ya-yb; double dz=za-zb;
  return dx*dx + dy*dy + dz*dz; 
}

Or perhaps, if you have many points in the vicinity, you might compute a distance (to the same fixed other point) by linear approximation of the distances of other near points.

answered Nov 10 '11 at 10:32

Basile Starynkevitch

223,805
18
296
547

4

you still have to do the array de-reference you are just moving the cost to outside the function. – Martin York Nov 10 '11 at 10:41
Agree with @Loki Astari, and there is possibility, that 6 args couldn't fit CPU registers… – tensai_cirno Nov 10 '11 at 10:55
on AMD64 with Linux, I think that 6 arguments fit in register (at least 6 integer or pointer arguments fit, I forgot about floats). – Basile Starynkevitch Nov 10 '11 at 11:11
And probably, the calling code might already keep the 6 values in registers already. – Basile Starynkevitch Nov 10 '11 at 11:12

score 0 · Answer 7 · answered Nov 17 '11 at 22:30

If you can rearrange your data to process two pairs of input vectors at once, you may use this code (SSE2 only)

// @brief Computes two squared distances between two pairs of 3D vectors
// @param a
//  Pointer to the first pair of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param b
//  Pointer to the second pairs of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param c
//  Pointer to the output 2 element array.
//  Must be aligned by 16 (2 doubles).
//  The two distances between a and b vectors will be written to c[0] and c[1] respectively.
void (const double * __restrict__ a, const double * __restrict__ b, double * __restrict c) {
    // diff0 = ( a0.y - b0.y, a0.x - b0.x ) = ( d0.y, d0.x )
    __m128d diff0 = _mm_sub_pd(_mm_load_pd(a), _mm_load_pd(b));
    // diff1 = ( a1.x - b1.x, a0.z - b0.z ) = ( d1.x, d0.z )
    __m128d diff1 = _mm_sub_pd(_mm_load_pd(a + 2), _mm_load_pd(b + 2));
    // diff2 = ( a1.z - b1.z, a1.y - b1.y ) = ( d1.z, d1.y )
    __m128d diff2 = _mm_sub_pd(_mm_load_pd(a + 4), _mm_load_pd(b + 4));

    // prod0 = ( d0.y * d0.y, d0.x * d0.x )
    __m128d prod0 = _mm_mul_pd(diff0, diff0);
    // prod1 = ( d1.x * d1.x, d0.z * d0.z )
    __m128d prod1 = _mm_mul_pd(diff1, diff1);
    // prod2 = ( d1.z * d1.z, d1.y * d1.y )
    __m128d prod2 = _mm_mul_pd(diff1, diff1);

    // _mm_unpacklo_pd(prod0, prod2) = ( d1.y * d1.y, d0.x * d0.x )
    // psum = ( d1.x * d1.x + d1.y * d1.y, d0.x * d0.x + d0.z * d0.z )
    __m128d psum = _mm_add_pd(_mm_unpacklo_pd(prod0, prod2), prod1);

    // _mm_unpackhi_pd(prod0, prod2) = ( d1.z * d1.z, d0.y * d0.y )
    // dotprod = ( d1.x * d1.x + d1.y * d1.y + d1.z * d1.z, d0.x * d0.x + d0.y * d0.y + d0.z * d0.z )
    __m128d dotprod = _mm_add_pd(_mm_unpackhi_pd(prod0, prod2), psum);

    __mm_store_pd(c, dotprod);
}

Fastest way to compute distance squared

7 Answers7

Linked