How to take advantage of cpu pipelining in intensive processing loops in C

Question

I am wondering how to make sure I take advantage of cpu pipelining in the following audio code:

int sample_count = 100;
// volume array - value to multiply audio sample by
double volume[4][4];
// fill volume array with values here

// audio sample array - really this is 125 samples by 16 channels but smaller here for clarity
double samples[sample_count][4];
// fill samples array with audio samples here
double tmp[4];

for (x=0;x<sample_count;x++) {
    tmp[0] = samples[x][0]*volume[0][0] + samples[x][1]*volume[1][0] + samples[x][2]*volume[2][0] + samples[x][3]*volume[3][0]; 
    tmp[1] = samples[x][0]*volume[0][1] + samples[x][1]*volume[1][1] + samples[x][2]*volume[2][1] + samples[x][3]*volume[3][1]; 
    tmp[2] = samples[x][0]*volume[0][2] + samples[x][1]*volume[1][2] + samples[x][2]*volume[2][2] + samples[x][3]*volume[3][2]; 
    tmp[3] = samples[x][0]*volume[0][3] + samples[x][1]*volume[1][3] + samples[x][2]*volume[2][3] + samples[x][3]*volume[3][3]; 

    samples[x][0] = tmp[0];
    samples[x][1] = tmp[1];
    samples[x][2] = tmp[2];
    samples[x][3] = tmp[3];
}

// write sample array out to hardware here.

In case its not immediately clear this mixes the 4 input channels via a 4x4 matrix of volume controls into 4 output channels.

I'm actually executing this quite a lot more intensively than the above example and I am not sure how to tailor my code to take advantage of pipelining (which this seems suitable for). Should I perhaps work on one 'channel' of the samples array at a time, so that the same value can be operated on several times (for sequential samples of the same channel)? That way however I will have to check x for > sample_count 4 times as many times. I could make tmp 2 dimensional and large enough to hold the full buffer, if working through it in this way would make the cpu pipeline efficiently. Or will the above code pipeline efficiently? Is there an easy way to check whether pipelining is happening? TIA.

There are no dependencies between iterations, so you're already "taking advantage" of pipelining, in the sense that you won't be introducing any stalls. — Oliver Charlesworth, Sep 28 '14 at 16:50
Keep in mind that when using SIMD explicitly, you are also obligated to follow certain data alignment requirements, which are conveniently the most efficient way for the particular platform to move large quantities of data in order to fuel the SIMDs which have much higher throughput than the ALUs. — dtech, Sep 28 '14 at 21:20
Had to read up on SIMD, and it looks like youre right this is the way to go for the above question. Thanks, I didn't know about this specific before. Still interested in understanding how to tune code to permit efficient pipelining, or I guess maybe its better to say 'not introducing stalls' as Oli said. If anyone can point me towards beginners intro to this subject it would be good. Thanks for the above also. — Pete, Oct 04 '14 at 10:58
Like another commenter said, your loop has no loop-carried dependency, so your code is already amenable to ILP (instruction level parallelism) extraction mechanisms like Out-of-Order execution and branch prediction. You should also take care of your data layout so that first, the layout is aligned to a certain granularity for SIMD instructions like SSE/AVX to by effective, second, minimize the number of cachelines you have to read for each iteration of the loop - i.e. improve cache locality. Use -march=native -O3 to enable agressive optimization like loop unrolling and vectorization. — fjs, Dec 13 '22 at 02:40

How to take advantage of cpu pipelining in intensive processing loops in C

0 Answers0