2

I have some code which I have been working on and in order to optimise it I have been trying to understand the compiler's optimisation process by testing how different types of input data affect its performance. A simplified version of my code is as follows.

function foo(Pair[] pairsData) {
   for (Pair p : pairsData) {
      res.append(bar(p))
   }
}
    

function bar(Pair p) {
    double minDistance = Double.MAX_VALUE;
    double bestI = -1;
    for (int i = 0; i < 4; i++) {
        double d = p.x - p.y - i;
        if (d < minDistance) {
            minDistance = d;
            bestI = i;
        }
    }
    return bestI;
}

I expected that if all the pairs in pairsData are the same then this would significantly improve performance as the compiler's branch predictor could achieve a perfect success rate and so there would be no branch mispredictions as a result, leading to improved performance. But I have found that especially for small amounts of data (50-100 pairs), the effect of increasing the percentage of identical datapoints is negligible. I have considered this may be due to the branch predictor not being able to optimise with such few data or because there simply isn't enough cost involved (just two updations) for branch misprediction to have a big effect over only a few iterations.

However, for extremely large data sizes (around 100,000 pairs), the performance appears to improve and then worsen (peaking when there are 50% identical pairs). How is this possible since surely the higher the percentage of pairs that are identical, the less branch misprediction and the fewer the number of times minDistance must be updated?

Amelia
  • 417
  • 1
  • 4
  • 12
  • 2
    What compiler? What language is this? What CPU / ISA are you compiling for? How do you generate the input data? Is there some pattern in the non-identical pairs that lets branch prediction still work? Some pairs would run the `if` body multiple times, so with good prediction, throughput can also depend on `d < minDistance` only being true the first time. (Or not at all if either double is `NaN`.) – Peter Cordes Feb 18 '21 at 00:50
  • 2
    BTW, this would optimize reasonably well with branchless SIMD: broadcast `p.x - p.y` to all 4 elements of an AVX vector, and subtract a `[0, 1, 2 ,3]` vector. Shuffle and 2x `_mm256_min_pd` to get a vector with every element = horizontal min. Then `_mm256_cmp_pd(v, mins, _CMP_EQ_OQ)` and `_mm256_movemask_pd` to get a bitmap of which element(s) are equal to the min, and `__builtin_ctz` that to bit-scan for the position of the first match. – Peter Cordes Feb 18 '21 at 00:54
  • 1
    @PeterCordes the language is Java using the javac compiler and the input data is made up of randomly generated points with one of the points (chosen at random) being duplicated so it fills up 0%,25%,50%,75% and finally 100% of the data (each benchmarked separately). – Amelia Feb 18 '21 at 01:06
  • 1
    @PeterCordes I adapted the code to ensure d < minDistance is only true the first time for the duplicate values and random between the others. This results in the performance worsening as the percentage of identical points increases – Amelia Feb 18 '21 at 01:27

0 Answers0