-1

Let say I've to multiply two array such as A[MAX_BUFFER] and B[MAX_BUFFER] (with a MAX_BUFFER = 256).

For some reason, each B[MAX_BUFFER] values are calculated at fixed control rate (8, for example), since each values would be heavy processed.

Later, I need to multiply each others to C[MAX_BUFFER], considering the (introduced) different spacing. So with A on 256 values, I'll got a B with variable size (32 in this example, since control rate is 8).

Here's an example code:

#include <iostream>
#include <math.h>

#define MAX_BUFFER 256

double HeavyFunction(double value) {
    if (value == 0) return 0.0;

    return pow(10.0, value); // heavy operations on value...
}

int main()
{    
    int blockSize = 256;
    int controlRate = 8;

    double A[MAX_BUFFER];
    double B[MAX_BUFFER];
    double C[MAX_BUFFER];

    // fill A
    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
        A[sampleIndex] = sampleIndex;
    }

    // fill B (control rated)
    int index = 0;
    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += controlRate, index++) {
        B[index] = HeavyFunction(index);
    }

    // calculate C
    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {     
        C[sampleIndex] = A[sampleIndex] + B[sampleIndex / 8];

        std::cout << C[sampleIndex] << std::endl;
    }
}

I need performance, since I'll process lots of those operations in parallel, sending many data in 1 seconds (somethings like 44100 samples splitted in blockSize <= MAX_BUFFER).

I'd like to avoid branch (i.e. if) and division (as in the example above), which are not CPU-like operations (processing a big amount of data).

In the example before, this will introduce sampleIndex / 8 * N "futile" N operation; things if I call that procedure for millions samples...

How would you refactor this code in a fancy and light way for CPU?

markzzz
  • 47,390
  • 120
  • 299
  • 507
  • Any decent optimizing compiler will replace the `/8` by a right shift. Look at the assembly generated before worrying about array indexing "overhead", there might very well be none at all. – Mat Oct 26 '18 at 07:21
  • @Mat: I'd like to learn how compiler will be smart than me :) Example of this right shift? So I can implement manually? – markzzz Oct 26 '18 at 07:23
  • 1
    It doesn't sound like you need this level of optimization (naively performing 44100 multiplications in 1 second is trivial unless you have severe hardware constraints), but you'll probably benefit more from using SSE/AVX instructions to do the operation than you will worrying about branching. – Collin Dauphinee Oct 26 '18 at 07:36
  • @markzzz you don't need to do the right shift yourself, the compiler is most likely smart enough to do it for you. – Jabberwocky Oct 26 '18 at 07:36
  • Again: I know compiler is smart! I'd like to see how the code could be without "division" and "if" :) – markzzz Oct 26 '18 at 07:42
  • @CollinDauphinee: it is if I call that on `44100 * nVoices * nParam * nFilter`. – markzzz Oct 26 '18 at 07:43
  • 1
    @markzzz look at the [generated assembly code](https://www.godbolt.org/z/sHuXI1), there is no division by 8, the division is done by a shift, look at the `sar eax, 3` instruction. – Jabberwocky Oct 26 '18 at 07:47
  • 1
    why is HeavyFunction(0) = 0 ? – UmNyobe Oct 26 '18 at 07:55
  • 1
    That's just an example, don't worry about it :) – markzzz Oct 26 '18 at 07:58
  • @Jabberwocky: it depends. If controlRate can be switched as "const" yes, but if its dynamic it won't use sar, but division: https://www.godbolt.org/z/CJrsQr – markzzz Oct 26 '18 at 08:31

2 Answers2

0

I think optimizer might do the job alone, but you can unroll the loop to avoid the division:

// calculate C
const max = blockSize / 8;
int j = 0;
for (int i = 0; i != max; ++i) {
    const auto b = B[i];
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
    C[j] = A[j] + b; std::cout << C[j] << std::endl; ++j;
}
Jarod42
  • 203,559
  • 14
  • 181
  • 302
  • 1
    two issues : 1- the integer division, which is going to be a shift, is the last thing you want to optimize here. 2 - The compiler will not optimize read from A and B, or write to C. Given we are in a multithreaded paradigm, doing so might cause side effects. So you really want to save B[i] in a local const variable if you need it 8 times. – UmNyobe Oct 26 '18 at 08:29
  • @UmNyobe: **(1)** is what OP wants, I don't think it is a problem either ;-). **(2)** For multithreading, we should have synchronization with mutex or use atomic variable. Now I use local variable for `B[i]` for possible aliasing if used in function. – Jarod42 Oct 26 '18 at 08:41
0

How do you iterate simultaneously two array that are not equally spaced in a optimized way?

Short answer : Focus on HeavyFunction, avoid sharing unnecessary stuff between threads.

Unfortunately your example doesn't fit he question. The arrays

double A[MAX_BUFFER];
double B[MAX_BUFFER];
double C[MAX_BUFFER];

are allocated on the stack by merely moving the stack pointer, So you could say they are very similar to one contiguous array.

Even if they weren't the modern caches are so complex that by trying to micro-optimize you can end up degrading performance.

Assuming you have

BUFFER_SIZE = 1024 * 1024 * 1024;
std::vector<double> A(MAX_BUFFER);
std::vector<double> B(MAX_BUFFER);

A good improvement is

std::vector<double> C{A};
for (int i = 0; i < blockSize/controlRate; i++) { 
     const double b = B[i];
     int indexStart = i*controlRate;
     for(int j = 0 ; j < controlRate; ++j){
        Cprime[indexStart+j] += b;
     }

}

You read A once (in blocks), B once (one double at a time), and access C the same amount of time.

UmNyobe
  • 22,539
  • 9
  • 61
  • 90
  • `controlRate` and `blockSize` can vary. Can't assure they are always exactly divisbile. For example, what if I have `blockSize` 250 and control rate 8? It will miss some steps... – markzzz Oct 29 '18 at 18:21