0

I thought memory access would be faster than the multiplication and division (although compiler-optimized) done with alpha blending. But it wasn't as fast as expected.

The 16 megabytes used for the table is not an issue in this case. But it is a problem if table lookup could even be slower than doing all the CPU calculations.

Can anyone explain to me why and what is happening? Will the table lookup beat out with a slower CPU?

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>

#define COLOR_MAX UCHAR_MAX

typedef unsigned char color;

color (*blending_table)[COLOR_MAX + 1][COLOR_MAX + 1];

static color blend(unsigned int destination, unsigned int source, unsigned int a) {
    return (source * a + destination * (COLOR_MAX - a)) / COLOR_MAX;
}

void initialize_blending_table(void) {
    int destination, source, a;

    blending_table = malloc((COLOR_MAX + 1) * sizeof *blending_table);
    for (destination = 0; destination <= COLOR_MAX; ++destination) {
        for (source = 0; source <= COLOR_MAX; ++source) {
            for (a = 0; a <= COLOR_MAX; ++a) {
                blending_table[destination][source][a] = blend(destination, source, a);
            }
        }
    }
}

struct timer {
    double start;
    double end;
};

void timer_start(struct timer *self) {
    self->start = clock();
}

void timer_end(struct timer *self) {
    self->end = clock();
}

double timer_measure_in_seconds(struct timer *self) {
    return (self->end - self->start) / CLOCKS_PER_SEC;
}

#define n 300

int main(void) {
    struct timer timer;
    volatile int i, j, k, l, m;

    timer_start(&timer);
    initialize_blending_table();
    timer_end(&timer);
    printf("init %f\n", timer_measure_in_seconds(&timer));

    timer_start(&timer);
    for (i = 0; i <= n; ++i) {
        for (j = 0; j <= COLOR_MAX; ++j) {
            for (k = 0; k <= COLOR_MAX; ++k) {
                for (l = 0; l <= COLOR_MAX; ++l) {
                    m = blending_table[j][k][l];
                }
            }
        }
    }
    timer_end(&timer);
    printf("table %f\n", timer_measure_in_seconds(&timer));

    timer_start(&timer);
    for (i = 0; i <= n; ++i) {
        for (j = 0; j <= COLOR_MAX; ++j) {
            for (k = 0; k <= COLOR_MAX; ++k) {
                for (l = 0; l <= COLOR_MAX; ++l) {
                    m = blend(j, k, l);
                }
            }
        }
    }
    timer_end(&timer);
    printf("function %f\n", timer_measure_in_seconds(&timer));

    return EXIT_SUCCESS;
}

result

$ gcc test.c -O3
$ ./a.out
init 0.034328
table 14.176643
function 14.183924
  • 1
    Why are your loop variables declared as `volatile`? It may very well prevent the compiler from performing some optimizations. – Frédéric Hamidi Jun 15 '15 at 15:40
  • @FrédéricHamidi It is necessary for the benchmark. –  Jun 15 '15 at 15:42
  • It's not clear why you deem it necessary, but it will force memory accesses on every iteration instead of, say, going through registers. I'm not sure it's a good idea in your situation. – Frédéric Hamidi Jun 15 '15 at 15:44
  • @FrédéricHamidi But without the `volatile` I cannot guarantee the loop is alive after optimization. It actually disappears by the way. –  Jun 15 '15 at 15:46
  • 1
    that's because your your compiler think it stupid to write to `m` that many times. Allocate an array for `m` and get rid of the volatile. – user3528438 Jun 15 '15 at 15:48
  • 1
    `volatile m` is sensible, but none of the other variables should be volatile. – Ben Voigt Jun 15 '15 at 16:08
  • Using a look-up table for such basic computations is generally going to be considerably slower than just performing the computations outright, and almost certainly so when your data is way too big to fit into L1. About the cheapest computation I've seen improved with a LUT of that size is computing `log` for double-precision numbers with additional multiplications and divisions thrown in, and that's considerably more expensive than what you are doing for alpha blending. –  Jun 15 '15 at 21:37

2 Answers2

2

Table lookup is not a panacea. It helps when the table is small enough, but in your case the table is very big. You write

16 megabytes used for the table is not an issue in this case

which I think is very wrong, and is possibly the source of the problem you experience. 16 megabytes is too big for L1 cache, so reading data from random indices in the table will involve the slower caches (L2, L3, etc). The penalty for cache misses is typically large; your blending algorithm must be very complex if you want your LUT solution to be faster.

Read the Wikipedia article for more info.

anatolyg
  • 26,506
  • 9
  • 60
  • 134
2

Your benchmark is hopelessly broken, it makes the LUT look a lot better than it actually is because it reads the table in-order.

If your performance results show that the LUT is worse than direct calculation, then when you start with real-world random access patterns and cache misses, the LUT is going to be much worse.

Focus on improving the computation, and enabling vectorization. It's likely to pay off far better than a table-based approach.

(source * a + destination * (COLOR_MAX - a)) / COLOR_MAX

with rearrangement becomes

(source * a + destination * COLOR_MAX - destination * a) / COLOR_MAX

which simplifies to

destination + (source - destination) * a / COLOR_MAX

which has one multiply and one division by a constant, both of which are very efficient. And it is easily vectorized.

You should also mark your helper function as inline, although a good optimizing compiler is probably inlining it anyway.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720