3

My macOS audio application uses a for loop to add two float arrays. Is there a more efficient way when size is huge?

int size = 5;
float array1[size] = {0.0, 0.1, 0.2, 0.3, 0.4};
float array2[size] = {0.5, 0.6, 0.7, 0.8, 0.9};
float sum[size];

for (int i = 0; i < size; i ++)
{
    sum[i] = array1[i] + array2[i];
}
Devin Roth
  • 43
  • 5
  • You have a very strange array indexing, you know? – bipll Sep 29 '19 at 06:23
  • Oops you're completely right. I should have edited it. In my application I'm dealing with pointers and memcpy which is why I have it that way. I'll edit it. – Devin Roth Sep 29 '19 at 06:28
  • 3
    [Here's a discussion on how to optimize this kind of code using SIMD.](https://stackoverflow.com/questions/47405717/dot-product-of-vectors-with-simd) Bottom line: you just need to use the right compiler options, and let the optimizer do the work. – user3386109 Sep 29 '19 at 06:46
  • 2
    Optimizations like this are greatly subject to circumstances. What processor are you targeting? What else is the program doing around this—is it doing other operations on some of this same data? How large are the arrays? Modern processors commonly have features for fast additions of arrays, but how and whether to use them is situation-dependent. Are non-portable solutions acceptable, such as solutions that use Intel AVX instructions? Is using compiler extensions okay? Will your compiler automatically vectorize the loop? – Eric Postpischil Sep 29 '19 at 11:13
  • @EricPostpischil This is specifically for macOS. So Intel Core i5 and i7s. Arrays sizes are up to 524,288. The only operation I'm performing is the sum. I'm not sure about the other stuff because I've never gone this deep before. – Devin Roth Sep 30 '19 at 00:25

3 Answers3

3

The most significant trick that you can do: if these arrays are in fact pointers and you pass them into a function, be sure to restrict-qualify the sum pointer, if indeed it is supposed to point to an array that is independent of the 2:

void do_sum(size_t size,
            float * restrict sum,
            float * array1,
            float * array2)

or with size hints

void do_sum(size_t size,
            float sum[restrict static size],
            float array1[static size],
            float array2[static size])

This will make the compiler be able to generate much more efficient code because it guarantees that neither array1[n] nor array2[n] could access the same memory as sum[k] for any n or k used in the function

See the difference yourself at Godbolt: with restrict and without

2

If this runs on hardware with more than a few watts (anything other than a ten year old phone without a built-in FPU and without a compiler that knows how to exploit the CPU's exotic instructions), efficiency will be dominated by memory caching and bus bandwidth, so clever C tricks won't matter. The only meaningful speedup would be to overwrite one of the arrays:
... array1[i] += array2[i];

Camille Goudeseune
  • 2,934
  • 2
  • 35
  • 56
1

For macOS (as clarified in a comment), the solution is easy, at least for a single add operation. Insert #include <Accelerate/Accelerate.h> in the code, add the Accelerate framework to your project, and change the loop to a single call to vDSP_vadd(sum, 1, array1, 1, array2, 1, size);. That uses a high-performance vectorized routine that Apple customizes for each platform it supports.

(The 1 parameters are the strides through the arrays, in units of elements. 1 means to process each element and is the best case for performance.)

Since you are working with up to 524,288 elements, you should also consider how your application interacts with cache memory. Designing for high performance cannot be done in isolation, looking only at each routine individually.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312