For macOS (as clarified in a comment), the solution is easy, at least for a single add operation. Insert #include <Accelerate/Accelerate.h>
in the code, add the Accelerate framework to your project, and change the loop to a single call to vDSP_vadd(sum, 1, array1, 1, array2, 1, size);
. That uses a high-performance vectorized routine that Apple customizes for each platform it supports.
(The 1
parameters are the strides through the arrays, in units of elements. 1
means to process each element and is the best case for performance.)
Since you are working with up to 524,288 elements, you should also consider how your application interacts with cache memory. Designing for high performance cannot be done in isolation, looking only at each routine individually.