Organizing multiple implementations (for SIMD)

Question

This is admittedly an open-ended/subjective question but I am looking for different ideas on how to "organize" multiple alternative implementations of the same functions.

I have a set of several functions that each have platform-specific implementations. Specifically, they each have a different implementation for a particular SIMD type: NEON (64-bit), NEON (128-bit), SSE3, AVX2, etc (and one non-SIMD implementation).

All functions have a non-SIMD implementation. Not all functions are specialized for each SIMD type.

Currently, I have one monolithic file that uses a mess of #ifdefs to implement the particular SIMD specializations. It worked when we were only specializing a few of the functions to one or two SIMD types. Now, it's become unwieldy.

Effectively, I need something that functions like a virtual/override. The non-SIMD implementations are implemented in a base class and SIMD specializations (if any) would override them. But I don't want actual runtime polymorphism. This code is performance critical and many of the functions can (and should) be inlined.

Something along these lines would accomplish what I need (which is still a mess of #ifdefs).

// functions.h

void function1();
void function2();

#ifdef __ARM_NEON
#include "functions_neon64.h"
#elif __SSE3__
#include "functions_sse3.h"
#endif

#include "functions_unoptimized.h"

// functions_neon64.h
#ifndef FUNCTION1_IMPL
#define FUNCTION1_IMPL
void function1() {
  // NEON64 implementation
}
#endif

// functions_sse3.h
#ifndef FUNCTION2_IMPL
#define FUNCTION2_IMPL
void function2() {
  // SSE3 implementation
}
#endif

// functions_unoptimized.h
#ifndef FUNCTION1_IMPL
#define FUNCTION1_IMPL
void function1() {
  // Non-SIMD implementation
}
#endif

#ifndef FUNCTION2_IMPL
#define FUNCTION2_IMPL
void function2() {
  // Non-SIMD implementation
}
#endif

Anyone have any better ideas?

Your code following *Something along these lines* is what I would do. Have one header file for each SIMD specialization, and then have one header file that dispatches to included the correct SIMD version you need. You shouldn't need the `FUNCTION1_IMPL` include guards using that approach. — NathanOliver, Jan 03 '22 at 18:18
The reason for FUNCTION1_IMPL is so the non-SIMD implementations get compiled for any functions without SIMD specialization. Essentially, I need to give the SIMD specialization "an opportunity" to implement the functions. If it doesn't, I need to "fall-back" to the non-SIMD implementations. — Matthew M., Jan 03 '22 at 18:28
Ah yeah, that makes sense. Another option would be to use tag dispatch. You would have one "main function" for each function, and that one would dispatch to the correct overload by using the appropriate tag. not sure if that would be any easier to implement then what you already have. — NathanOliver, Jan 03 '22 at 18:33

Turtlefight · Accepted Answer · 2022-01-03T23:29:16.520

The following are just some ideas that i came up with while thinking about it - there might be better solutions that i'm not aware of.

1. Tag-Dispatch

Using Tag-Dispatch you can define an order in which the functions should be considered by the compiler, e.g. in this case it's

AVX2 -> SSE3 -> Neon128 -> Neon64 -> None

The first implementation that's present in this chain will be used: godbolt example

/**********************************
 ** functions.h *******************
 *********************************/

struct SIMD_None_t {};
struct SIMD_Neon64_t : SIMD_None_t {};
struct SIMD_Neon128_t : SIMD_Neon64_t {};
struct SIMD_SSE3_t : SIMD_Neon128_t {};
struct SIMD_AVX2_t : SIMD_SSE3_t {};
struct SIMD_Any_t : SIMD_AVX2_t  {};

#include "functions_unoptimized.h"

#ifdef __ARM_NEON
#include "functions_neon64.h"
#endif

#ifdef __SSE3__
#include "functions_see3.h"
#endif

// etc...

#include "functions_stubs.h"



/**********************************
 ** functions_unoptimized.h *******
 *********************************/
inline int add(int a, int b, SIMD_None_t) {
    std::cout << "NONE" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon64.h ************
 *********************************/
inline int add(int a, int b, SIMD_Neon64_t) {
    std::cout << "NEON!" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon128.h ***********
 *********************************/
inline int add(int a, int b, SIMD_Neon128_t) {
    std::cout << "NEON128!" << std::endl;
    return a + b;
}

/**********************************
 ** functions_stubs.h ************* 
 *********************************/
inline int add(int a, int b) {
    return add(a, b, SIMD_Any_t{});
}

/**********************************
 ** main.cpp **********************
 *********************************/
#include "functions.h"

int main() {
    add(1, 2);
}

This would output NEON128!, since that's the best match in this case.

Upsides:

no #ifdef's needed in the implementation header files
callers don't need to be modified

Downsides:

You'll need to add an extra argument to each implementation
A dispatch-function is required to supply the extra argument
(You could theretically get rid of this function by adding , SIMD_Any_t{} everywhere you call the function, but that's a lot of work)

2. Put the functions into classes and use name lookup to pick the right function

e.g.:

struct None { inline static int add(int a, int b) { return a + b; } };
struct Neon64 : None { inline static int add(int a, int b) { return a + b; } };
struct Neon128 : Neon64 {};

struct SIMD : Neon128 {};

// Usage:
int r = SIMD::add(1, 2);

Because child classes can hide members of their base-classes this is not ambiguos. (always the most-derived class that implements the given method is the one that will be called, so you can order your implementations)

For your example it could look like this: godbolt example


#include <iostream>

/**********************************
 ** functions.h *******************
 *********************************/

#include "functions_unoptimized.h"

#ifdef __ARM_NEON
#include "functions_neon64.h"
#else
  struct SIMD_Neon64 : SIMD_None {};
#endif

#ifdef __ARM_NEON_128
#include "functions_neon128.h"
#else
  struct SIMD_Neon128 : SIMD_Neon64 {};
#endif

// etc...

struct SIMD : SIMD_Neon128 {};


/**********************************
 ** functions_unoptimized.h *******
 *********************************/
struct SIMD_None {
    inline static int sub(int a, int b) {
        std::cout << "NONE" << std::endl;
        return a - b;
    }
};

/**********************************
 ** functions_neon64.h ************
 *********************************/
struct SIMD_Neon64 : SIMD_None {
    inline static int sub(int a, int b) {
        std::cout << "Neon64" << std::endl;
        return a - b;
    }
};

/**********************************
 ** functions_neon128.h ***********
 *********************************/
struct SIMD_Neon128 : SIMD_Neon64 {
    inline static int sub(int a, int b) {
        std::cout << "Neon128" << std::endl;
        return a - b;
    }
};


/**********************************
 ** main.cpp **********************
 *********************************/
#include "functions.h"

int main() {
    SIMD::sub(2, 3);
}

This would output Neon128.

Upsides:

No #ifdef's needed in the implementation header files
No dispatch function required, the compiler will automatically pick the best one
No extra function parameters required

Downsides:

You need to change all calls to the functions & prefix them with SIMD::
You need to wrap all the functions inside struct's & use inheritance, so it's a bit involved

3. Using template specializations

If you have an enum of all possible SIMD implementations, e.g.:

enum class SIMD_Type {
    Min, // Dummy Value -> No Implementation found

    None,
    Neon64,
    Neon128,
    SSE3,
    AVX2,

    Max // Dummy Value -> Search downwards from here
};

You can use it to (recursively) walk through them until you find one that has been specialized, e.g:

template<SIMD_Type type = SIMD_Type::Max>
inline int add(int a, int b) {
    constexpr SIMD_Type nextType = static_cast<SIMD_Type>(static_cast<int>(type) - 1);
    return add<nextType>(a, b);
}

template<>
inline int add<SIMD_Type::Neon64>(int a, int b) {
    std::cout << "NEON!" << std::endl;
    return a + b;
}

Here a call to add(1, 2) would first call add<SIMD_Type::Max>, which in turn would call add<SIMD_Type::AVX2, add<SIMD_Type::SSE3>, add<SIMD_Type::Neon128>, and then the call to add<SIMD_Type::Neon64> would call the specialization so recursion stops here.

If you want to make this a bit more safer (to prevent long template instaciation chains) you can additionally add one specialization for each function that stops recursion if it fails to find any specialization, e.g.: godbolt example

template<>
inline int add<SIMD_Type::Min>(int a, int b) {
    static_assert(SIMD_Type::Min == SIMD_Type::Min, "No implementation found!");
    return {};
}

In your case it could look like this:

#include <iostream>

/**********************************
 ** functions.h *******************
 *********************************/
enum class SIMD_Type {
    Min, // Dummy Value -> No Implementation found

    None,
    Neon64,
    Neon128,
    SSE3,
    AVX2,

    Max // Dummy Value -> Search downwards from here
};

#include "functions_stubs.h"

#include "functions_unoptimized.h"

#ifdef __ARM_NEON
#include "functions_neon64.h"
#endif

#ifdef __SSE3__
#include "functions_see3.h"
#endif

// etc...

/**********************************
 ** functions_stubs.h *************
 *********************************/
template<SIMD_Type type = SIMD_Type::Max>
inline int add(int a, int b) {
    constexpr SIMD_Type nextType = static_cast<SIMD_Type>(static_cast<int>(type) - 1);
    return add<nextType>(a, b);
}

template<>
inline int add<SIMD_Type::Min>(int a, int b) {
    static_assert(SIMD_Type::Min == SIMD_Type::Min, "No implementation found!");
    return {};
}

/**********************************
 ** functions_unoptimized.h *******
 *********************************/
template<>
inline int add<SIMD_Type::None>(int a, int b) {
    std::cout << "NONE" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon64.h ************
 *********************************/
template<>
inline int add<SIMD_Type::Neon64>(int a, int b) {
    std::cout << "NEON!" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon128.h *******************
 *********************************/
template<>
inline int add<SIMD_Type::Neon128>(int a, int b) {
    std::cout << "NEON128!" << std::endl;
    return a + b;
}

/**********************************
 ** main.cpp **********************
 *********************************/
#include "functions.h"

int main() {
    add(1, 2);
}

would output NEON128!.

Upsides:

no #ifdef's needed in the implementation header files
callers don't need to be modified

Downsides:

Needs an extra dispatch function that recursively calls itself (until it hits an specialization)
The compiler might not optimize all recursive calls (altough most compilers probably will)
Most compilers also offer you a way to force inlining for certain functions (__attribute__((always_inline)) / __forceinline) which you could add the the function base templates to make sure all recursive calls actually get inlined.
Optionally needs another function to stop recursive instanciation (not strictly required, compilers will stop recursive instanciation at some point)

4. One file per function

This is by far the easiest option - just put each function (or a collection of similar functions) into a single file and do the #ifdef's there.

That way you have all the functions & their specializations for SIMD in a single file, which should also make editing a lot easier.

e.g.:

/**********************************
 ** functions.h *******************
 *********************************/

#include "functions_add.h"
#include "functions_sub.h"
// etc...

/**********************************
 ** functions_add.h ***************
 *********************************/
#ifdef __SSE3__
// SSE3
int add(int a, int b) {
  return a + b;
}
#elifdef __ARM_NEON
// NEON
int add(int a, int b) {
  return a + b;
}
#else
// Fallback
int add(int a, int b) {
  return a + b;
}
#end

/**********************************
 ** functions_sub.h ***************
 *********************************/
#ifdef __SSE3__
// SSE3
int sub(int a, int b) {
  return a - b;
}
#elifdef __ARM_NEON_128
// NEON 128
int sub(int a, int b) {
  return a - b;
}
#else
// Fallback
int sub(int a, int b) {
  return a - b;
}
#end

Upsides:

The function & all of its specializations are in a single file, so figuring out which one gets called is a lot easier
Easy to implement & maintain as long as you don't stuff too many functions into a single file

Downsides:

Potentially lots of header files
#ifdef's need to be repeated in each header

Your tag-dispatch example compiles all the functions in one build. But that doesn't work for functions that use target-specific intrinsics / functions. For example, ARM C implementations won't have a `#include ` (Intel intrinsics), so the AVX2 and SSSE3 functions won't build; `__m256i foo` will be an undefined type, and `_mm256_loadu_si256()` / `_mm256_add_epi32()` will be undefined functions. — Peter Cordes, Jan 03 '22 at 22:47
@PeterCordes there are `#ifdef` guards for the header inclusion, so the relevant headers should not be included unless support for the given SIMD variant is present. (i.e. `#ifdef __ARM_NEON #include "functions_neon64.h" #endif`) — Turtlefight, Jan 03 '22 at 22:48
Your example `int add(int a, int b, SIMD_Neon64_t)` just prints output using portable code that can compile for x86. But the point of all this is to use it for functions with code that depends on those headers you didn't include, like `int32x4_t av = vld1q_s32(&(a[i]));` as in [ARM Neon intrinsics, addition of two vectors](https://stackoverflow.com/q/69792190). You *need* conditional compilation for the actual manually-vectorized SIMD implementations using intrinsics. (And if they're small, you'd like them to be able to inline, although link-time optimization may be acceptable for that.) — Peter Cordes, Jan 03 '22 at 23:00
@PeterCordes that's what i was trying to say - the code blob above needs to be read as individual files (i've just put it all into one code block to not completely bloat my way too long post) - you can use `#include ` inside `functions_neon64.h`, since `functions_neon64.h` only gets included from `functions.h` if `__ARM_NEON` is defined. (end users of these functions would only include `functions.h`) - If you're concerned about the arguments, the OP's current solution would have the same problem, so i assume they already use some type for the parameters that is portable. — Turtlefight, Jan 03 '22 at 23:17
Ok, that would work. Your `// main.c` is missing a `#include "functions.h"`, and it might be more clear if you made the file-start comments look like more of a separator, like `/********* main.c ************/`. I was thinking that the comments were something to do with where a prototype was already declared. (In hindsight that doesn't really make sense, but I hadn't thought of this code block being multiple separate files.) — Peter Cordes, Jan 03 '22 at 23:22
BTW: The SIMD intrinsics are very suitable for emulating on a general platform (with normal instead of vector operations). This helps with comparing (functionality, not performance). You just have to replace the header with one, where you define your own (class) types (with small arrays as member variables) and functions. Those would run on any system: x86, ARM, etc. — Sebastian, Jan 03 '22 at 23:24
You could combine all the x86 SIMD versions into one `.h` so they can share the same `` include, and share helper functions if needed. Within one target ISA with different feature-levels of the same SIMD extension, the `#ifdef`s are a lot less messy. — Peter Cordes, Jan 03 '22 at 23:27
Excellent answer. Thanks for taking the time. Tag-dispatch looks good to me. — Matthew M., Jan 04 '22 at 14:35
All of those solutions are compatible with a tree relationship of the architectures (e.g. root = unoptimized, ARM branch, Intel branch). The template specializations one (3.) would have to be adapted for that logic. Can the tag dispatch solution put the argument type as template parameter instead? — Sebastian, Jan 04 '22 at 14:48

Organizing multiple implementations (for SIMD)

1 Answers1

1. Tag-Dispatch

2. Put the functions into classes and use name lookup to pick the right function

3. Using template specializations

4. One file per function