I would like to do optimize the following code.
void conv(int n, double* output, const double* input, double p1, double p2, double p3) {
for (int i = 1; i + 1 < n; ++i) {
output[i] = input[i - 1] * p1 + input[i] * p2 + input[i + 1] * p3;
}
}
I am on redhat linux, intel x86.
The code seemed like it would benefit from vectorization. So I tried to use the AVX instruction set.
#include <immintrin.h>
void conv(int n, double* output, const double* input, double p1, double p2, double p3) {
const __m256d v1 = _mm256_set1_pd(p1);
const __m256d v2 = _mm256_set1_pd(p2);
const __m256d v3 = _mm256_set1_pd(p3);
__m256d acc;
int i = 0;
for (; i + 4 < n; i += 4) {
acc = _mm256_mul_pd(_mm256_loadu_pd(input + i), v1);
acc = _mm256_add_pd(acc, _mm256_mul_pd(_mm256_loadu_pd(input + i + 1), v2));
acc = _mm256_add_pd(acc, _mm256_mul_pd(_mm256_loadu_pd(input + i + 2), v3));
_mm256_storeu_pd(output + i + 1, acc);
}
for (; i + 2 < n; ++i) {
output[i + 1] = input[i] * p1 + input[i + 1] * p2 + input[i + 2] * p3;
}
}
According to callgrind, the number of instructions significantly went down. However, there doesn't seem to be much improvement in actual runtime. I have also tried AVX512 bits, but it actually seems to perform even worse.
My data has size of n around 300. For reference, here is a godbolt link to my avx code. https://godbolt.org/z/MGrW6c4cj
I am open to any suggestions on speeding up my code further. P.S. This is my very first time using AVX.