2

This is the code I actually had (for a scalar code) which I've replicated (x4) storing data into simd:

waveTable *waveTables[4];
for (int i = 0; i < 4; i++) {
    int waveTableIindex = 0;
    while ((phaseIncrement[i] >= mWaveTables[waveTableIindex].mTopFreq) && (waveTableIindex < kNumWaveTableSlots)) {
        waveTableIindex++;
    }
    waveTables[i] = &mWaveTables[waveTableIindex];
}

Its not "faster" at all, of course. How would you do the same with simd, saving cpu? Any tips/starting point? I'm with SSE2.

Here's the context of the computation. topFreq for each wave table are calculated starting from the max harmonic amounts (x2, due to Nyquist), and multiply for 2 on every wave table (dividing later the number of harmonics available for each table):

double topFreq = 1.0 / (maxHarmonic * 2);
while (maxHarmonic) {
    // fill the table in with the needed harmonics
    // ... makeWaveTable() code
    
    // prepare for next table
    topFreq *= 2;
    maxHarmonic >>= 1;
}

Than, on processing, for each sample, I need to "catch" the correct wave table to use, due to the osc's freq (i.e. phase increment):

freq = clamp(freq, 20.0f, 22050.0f);
phaseIncrement = freq * vSampleTime;

so, for example (having vSampleTime = 1/44100, maxHarmonic = 500), 30hz is wavetable 0, 50hz is wavetable 1, and so on

markzzz
  • 47,390
  • 120
  • 299
  • 507
  • 2
    This is a very small code snippet, and it misses essential details including various relevant definitions. Still it appears that `waveTables` and `phaseIncrement` both have 4 elements. That's unlikely to be big enough to justify SIMD operations. Also, for these kinds of conditional operations you should be looking at AVX-512, which is 5 generations newer than your SSE2 baseline. (skipping SSE3, SSE4, AVX, AVX2) – MSalters Aug 16 '21 at 10:14
  • @MSalters im within an osc (audio), processed 4 voice per samples, 16 voices, 4x osc. it will do process that snippet often :) – markzzz Aug 16 '21 at 10:20
  • 1
    You have to show more context (i.e., a [mre]). SIMD optimization often starts with choosing appropriate data structures (e.g., SoA instead of AoS). – chtz Aug 16 '21 at 10:54
  • Looking at it even further, I also note that the `waveTableIindex < kNumWaveTableSlots` check happens **after** it's used as an array index. That is almost certainly the wrong order. But it shows the complexity of porting such algorithms to SIMD; you need to have the whole picture. A simple fix here would be to set `mWaveTables[last].mTopFreq` to the maximum possible value of `phaseIncrement` (i.e. make it a sentinel). – MSalters Aug 16 '21 at 10:59
  • 2
    It depends mostly on what's inside the while loop, and what kind of condition is being checked, so this isn't a very descriptive title. e.g. linear search is easy to SIMD, something with a serial dependency on calculations in the previous iteration is usually much harder (unless you can do 4 independent serial dep chains in parallel, possibly with a different data layout to make that work.) – Peter Cordes Aug 16 '21 at 11:33
  • I've added some details which can help to catch the context. is it more clear now? let me know, else i'll edit further... – markzzz Aug 16 '21 at 11:45
  • "30hz is wavetable 0, 50hz is wavetable 1" contradicts the `topFreq *= 2;`-line. You did not post a [mre]. It may be possible to explicitly compute the index from the given frequency – chtz Aug 16 '21 at 16:24
  • @chtz why contradicts? 30*500=15000 is lower than half samplerate (22050 in this case), so can be choice the first table. 50*500=25000 is above, so second wavetable. And so on, with 2^n step (i.e. topFreq*=2, as stated). I must take the first wavetable which doesn't go over the limit dictated by topFreq, for each table. – markzzz Aug 16 '21 at 16:29
  • @PeterCordes is the question more clear now? Should I re-edit the title? Linear search? – markzzz Aug 17 '21 at 06:57

3 Answers3

2

Assuming your values are FP32, I would do it like this. Untested.

const __m128 phaseIncrements = _mm_loadu_ps( phaseIncrement );
__m128i indices = _mm_setzero_si128();
__m128i activeIndices = _mm_set1_epi32( -1 );

for( size_t idx = 0; idx < kNumWaveTableSlots; idx++ )
{
    // Broadcast the mTopFreq value into FP32 vector. If you build this for AVX1, will become 1 very fast instruction.
    const __m128 topFreq = _mm_set1_ps( mWaveTables[ idx ].mTopFreq );
    // Compare for phaseIncrements >= topFreq
    const __m128 cmp_f32 = _mm_cmpge_ps( phaseIncrements, topFreq );
    // The following line compiles into no instruction, it's only to please the type checker
    __m128i cmp = _mm_castps_si128( cmp_f32 );
    // Bitwise AND with activeIndices
    cmp = _mm_and_si128( cmp, activeIndices );
    // The following line increments the indices vector by 1, only the lanes where cmp was TRUE
    indices = _mm_sub_epi32( indices, cmp );
    // Update the set of active lane indices
    activeIndices = cmp;
    // The vector may become completely zero, meaning all 4 lanes have encountered at least 1 value where topFreq < phaseIncrements
    if( 0 == _mm_movemask_epi8( activeIndices ) )
        break;
}

// Indices vector keeps 4 32-bit integers
// Each lane contains index of the first table entry less than the corresponding lane of phaseIncrements
// Or maybe kNumWaveTableSlots if not found
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 1° question: I can broadcast topFreq when building tables. Would this be even faster? i.e. load single float and broadcast to _m128 is slower than load a _m128 directly, correct? – markzzz Aug 17 '21 at 19:19
  • 2
    @markzzz Without AVX, that `_mm_set1_ps` compiles into `_mm_load_ss` + `_mm_shuffle_ps` https://godbolt.org/z/TP4zvMWMr These shuffles are fast at 1 cycle of latency, and the rest of the code in that function doesn’t use any shuffles. AVX with `_mm_broadcast_ss` is better but not by much. When loading 16-byte vectors, you gonna need to read 4x as many cache lines. Not always the case but I think in this particular one, extra RAM bandwidth is more expensive than two saved instructions (two because the load gonna merge into comparison). – Soonts Aug 17 '21 at 21:34
  • Thanks. Tried you code, but I don't see any gain rather write "scalar" code as I've posted. Compiler already did some great stuff I guess... – markzzz Aug 18 '21 at 10:14
  • @markzzz Assuming the profiler tells you the `for` loop in question is what taking the time, one optimization you can try is refactor data structure so the `mTopFreq` values are continuous in memory. This way the `for` loop gonna load from sequential addresses, and once you find all 4 indices, use them to index another pointer with the payload you have in these tables. – Soonts Aug 18 '21 at 12:51
  • One more thing, if the next code needs scalar indices to compute these pointers, the right way to extract them from the vector is `_mm_cvtsi128_si32` + for non-zero lanes some other instruction to shuffle the vector like `_mm_srli_si128` or `_mm_shuffle_epi32`. Or if you have SSE 4.1 use `_mm_extract_epi32` – Soonts Aug 18 '21 at 12:52
  • not sure about the suggestion about topFreq refactoring. I can place them continuous in memory such as topFreqs[kNumWaveTableSlots] of course, but why this will speed up things? Should I access to it in different way? Can you show to me an example? Thanks – markzzz Aug 18 '21 at 13:03
  • 1
    @markzzz RAM is a block device these days. Inside CPUs the block size is 64 bytes (cache line), outside the chip often 16 bytes (DDR3 and DDR4 modules deliver 8 bytes per transaction, and well-built computers have dual channel RAM). If you’ll rework to `float topFreqs[kNumWaveTableSlots];` your code gonna need much fewer blocks of memory to complete the loop. Also improves caching, `topFreqs` table has higher chance to fit and stay in L1D cache. – Soonts Aug 18 '21 at 13:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236137/discussion-between-markzzz-and-soonts). – markzzz Aug 18 '21 at 13:23
0

There is no standard way to write SIMD instructions in C++. A compiler may produce SIMD instructions when appropriate as long as you've configured it to target a CPU that supports such instructions and enabled relevant optimisations. You can use standard algorithms using the std::execution::unsequenced_policy to help compiler understand that SIMD is appropriate.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • [Inline assembly](https://en.cppreference.com/w/c/language/asm) is another option, although it is generally better to rely on compiler optimizations and profiling. – Yun Aug 16 '21 at 09:14
  • 5
    @Yun: That is almost certainly excessive. SSE intrinsics are a much better choice. Let the compiler deal with register allocations etcetera. – MSalters Aug 16 '21 at 10:10
-1

If you are using GCC/G++ or Clang, there is a non-standard language extension for vector extensions. using __attribute__ ((vector_size (xx))). See the GCC manual for details

https://gcc.gnu.org/onlinedocs/gcc-11.2.0/gcc/Vector-Extensions.html#Vector-Extensions

doron
  • 27,972
  • 12
  • 65
  • 103