0

I would like to do optimize the following code.

void conv(int n, double* output, const double* input, double p1, double p2, double p3) {
    for (int i = 1; i + 1 < n; ++i) {
        output[i] = input[i - 1] * p1 + input[i] * p2 + input[i + 1] * p3;
    }
}

I am on redhat linux, intel x86.

The code seemed like it would benefit from vectorization. So I tried to use the AVX instruction set.

#include <immintrin.h>
void conv(int n, double* output, const double* input, double p1, double p2, double p3) {
    const __m256d v1 = _mm256_set1_pd(p1); 
    const __m256d v2 = _mm256_set1_pd(p2); 
    const __m256d v3 = _mm256_set1_pd(p3); 
    __m256d acc;
    int i = 0;
    for (; i + 4 < n; i += 4) {
        acc = _mm256_mul_pd(_mm256_loadu_pd(input + i), v1);
        acc = _mm256_add_pd(acc, _mm256_mul_pd(_mm256_loadu_pd(input + i + 1), v2));
        acc = _mm256_add_pd(acc, _mm256_mul_pd(_mm256_loadu_pd(input + i + 2), v3));
        _mm256_storeu_pd(output + i + 1, acc);
    }

    for (; i + 2 < n; ++i) {
        output[i + 1] = input[i] * p1 + input[i + 1] * p2 + input[i + 2] * p3;
    }
}

According to callgrind, the number of instructions significantly went down. However, there doesn't seem to be much improvement in actual runtime. I have also tried AVX512 bits, but it actually seems to perform even worse.

My data has size of n around 300. For reference, here is a godbolt link to my avx code. https://godbolt.org/z/MGrW6c4cj

I am open to any suggestions on speeding up my code further. P.S. This is my very first time using AVX.

JEK
  • 11
  • 3
  • 2
    The godbolt link didn't include an `-O2` flag or something similar, is that just an accident or also representative of how you compiled the code? SIMD code often runs very poorly at `-O0` – harold Jun 21 '23 at 02:00
  • I have tried -O2 and -O3. But there wasn't much speed up. – JEK Jun 21 '23 at 02:39
  • Semi-related: general stuff about optimizing 1D convolutions (for FIR filters) https://web.archive.org/web/20230322102700/https://thewolfsound.com/fir-filter-with-simd/ / https://web.archive.org/web/20230322085218/https://thewolfsound.com/data-alignment-in-fir-filter-simd-implementation/ . With a case like this with a filter length of only 3, that's maybe a bit different from that guide which is assuming longer filter lengths like 256 or 512. If you weren't getting a speedup from manual vectorization, maybe the compiler was already auto-vectorizing this way with `-O3`? – Peter Cordes Jun 21 '23 at 03:04
  • Do you have FMA available? Instead of `-mavx` I would usually recommend to compile with `-march=native` (or `-march=some_architecture`, providing the minimal architecture you want your binary to support). And of course, always compile with at least `-O2` for performance builds. – chtz Jun 21 '23 at 11:14
  • Also: How did you benchmark your code? Running multiple times on the same data? On different data? Taking the average/minimum/maximum time? How many FLOPS/sec do you have for the scalar vs the AVX version? – chtz Jun 21 '23 at 11:17
  • I have used native. I tried on input = i*i. I repeated the run on same input multiple times – JEK Jun 21 '23 at 12:59
  • @JEK On my computer with AMD Zen 3 cores, your scalar version takes 221 nanoseconds, your AVX version 74 nanoseconds, which is 3 times faster. Measured with n=300 on random input, best of 10 tests. – Soonts Jul 09 '23 at 09:49

0 Answers0