3

I have a large uint8_t array (size = 1824 * 942). I want to do the same operation to each element. Particularly I need to subtract -15 from each element.

This array is refreshed 20 times per second, so time is an issue and I'm avoiding loops over the array.

Is there an easy way to do this?

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
Ivan
  • 1,352
  • 2
  • 13
  • 31
  • `std::transform ` might be the right tool to do this. – ichherzcplusplus Oct 17 '19 at 10:13
  • 2
    Performing the "the same operation to each element" means doing the same operation over and over again once for each element. That "repeating" is basically "looping" so doing it without "looping" is logically impossible. – Galik Oct 17 '19 at 10:13
  • Unroll your loop. Create 1824*942 statements, that each change one element. That will avoid a loop. Whether it performs better (by any measure) than a loop is something you can only determine by testing. More generally, though, this is an XY problem. – Peter Oct 17 '19 at 10:16
  • 2
    Unless for display purposes (then there could be other options on the side of the graphics card), subtracting 15 must not be the only operation you apply to the pixels, and the total cost must be much higher. But maybe you needn't process all pixels in fact ? As usual, no good answer can be given without more context. –  Oct 17 '19 at 10:38
  • I'm receiving an image from an RGGB sensor, and I need to compensate the data pedestal (15) to get the true black value. That's pretty much the context. After this I perform the Debayerization + CCM + White balance + Gamma => Which are operations performed in 3 channels. I want to compensate the data pedestel in 1 channel to have less CPU consumption. I'll try some of the solutions here and if it's too slow I will try compensation on each of the 3 channels (where i'm already doing treatment wich loops) – Ivan Oct 17 '19 at 12:25

4 Answers4

5

You can just write a function with a plain loop:

void add(uint8_t* a, size_t a_len, uint8_t b) {
    for(uint8_t* ae = a + a_len; a < ae; ++a)
        *a += b;
}

And hope that the compiler vectorizes that for you, which it does, see assembly.

Solutions with std::for_each and std::transform such as:

void add(uint8_t* a, size_t a_len, uint8_t b) {
    std::transform(a, a + a_len, a, [b](auto value) { return value + b; });
}

Should generate exactly the same code, but sometimes they don't.


[Updated]

Out of curiosity, I benchmarked the following solutions:

#include <benchmark/benchmark.h>

#include <cstdint>
#include <array>
#include <algorithm>

#include <immintrin.h>

constexpr size_t SIZE = 1824 * 942;
alignas(32) std::array<uint8_t, SIZE> A;

__attribute__((noinline)) void add_loop(uint8_t* a, size_t a_len, uint8_t b) {
    for(uint8_t* ae = a + a_len; a < ae; ++a)
        *a += b;
}

__attribute__((noinline)) void add_loop_4way(uint8_t* a, size_t a_len, uint8_t b) {
    a_len /= 4;
    for(uint8_t* ae = a + a_len; a < ae; ++a) {
        a[a_len * 0] += b;
        a[a_len * 1] += b;
        a[a_len * 2] += b;
        a[a_len * 3] += b;
    }
}

__attribute__((noinline)) void add_transform(uint8_t* a, size_t a_len, uint8_t b) {
    std::transform(a, a + a_len, a, [b](auto value) { return value + b; });
}

inline void add_sse_(__m128i* sse_a, size_t a_len, uint8_t b) {
    __m128i sse_b = _mm_set1_epi8(b);
    for(__m128i* ae = sse_a + a_len / (sizeof *sse_a / sizeof b); sse_a < ae; ++sse_a)
        *sse_a = _mm_add_epi8(*sse_a, sse_b);
}

__attribute__((noinline)) void add_sse(uint8_t* a, size_t a_len, uint8_t b) {
    add_sse_(reinterpret_cast<__m128i*>(a), a_len, b);
}

inline void add_avx_(__m256i* avx_a, size_t a_len, uint8_t b) {
    __m256i avx_b = _mm256_set1_epi8(b);
    for(__m256i* ae = avx_a + a_len / (sizeof *avx_a / sizeof b); avx_a < ae; ++avx_a)
        *avx_a = _mm256_add_epi8(*avx_a, avx_b);
}

__attribute__((noinline)) void add_avx(uint8_t* a, size_t a_len, uint8_t b) {
    add_avx_(reinterpret_cast<__m256i*>(a), a_len, b);
}

template<decltype(&add_loop) F>
void B(benchmark::State& state) {
    for(auto _ : state)
        F(A.data(), A.size(), 15);
}

BENCHMARK_TEMPLATE(B, add_loop);
BENCHMARK_TEMPLATE(B, add_loop_4way);
BENCHMARK_TEMPLATE(B, add_transform);
BENCHMARK_TEMPLATE(B, add_sse);
BENCHMARK_TEMPLATE(B, add_avx);

BENCHMARK_MAIN();

Results on i7-7700k CPU and g++-8.3 -DNDEBUG -O3 -march=native -mtune=native:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
B<add_loop>                  31589 ns        31589 ns        21981
B<add_loop_4way>             30030 ns        30030 ns        23265
B<add_transform>             31590 ns        31589 ns        22159
B<add_sse>                   39993 ns        39992 ns        17403
B<add_avx>                   31588 ns        31587 ns        22161

Times for loop, transform and AVX2 versions are pretty much identical.

SSE version is slower because the compiler generates faster AVX2 code.

perf report reports ~50% L1d-cache miss rate which indicates that the algorithm is bottlenecked by memory access. Modern CPUs can handle multiple memory accesses simultaneously, so that you can squeeze an extra ~5% of performance here by accessing 4 regions of memory in parallel, which is what the 4-way loop does (for your particular array size 4 ways is the fastest). See Memory-level parallelism: Intel Skylake versus Intel Cannonlake for more details.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
2

You could use std::for_each:

uint8_t value = 15;
std::for_each(std::begin(nums), std::end(nums), [value](uint8_t& num) { num -= value; });

where nums is an array of uint8_t.

jignatius
  • 6,304
  • 2
  • 15
  • 30
0

This should be the fastest way to do it:

#include <iostream>
#include <cstdint>
#include <array>
#include <algorithm>
#include <execution>


int main() {
    constexpr size_t  size = 1824 * 942;
    uint16_t input{};
    std::cout << "Initialize with: ";
    std::cin >> input;
    std::array<uint8_t, size> array{};
    std::fill(std::execution::par_unseq, array.begin(), array.end(), input);

    std::transform(std::execution::par_unseq,array.begin(), array.end(), array.begin(), [] (const auto& value) { return value + 15; });

    std::for_each(array.begin(),array.end(), [] (auto value) {
        std::cout << static_cast<uint16_t>(value) << ",";
    });
    std::cout << "\n";
}

Note the significant line std::transform(std::execution::par_unseq,array.begin(), array.end(), array.begin(), [] (const auto& value) { return value + 15; }); the rest is for examples sake.

Also note that, since you did not specify which array type, you can transform built in arrays like uint8_t array[1924*924]; to an std::array with std::to_array.

Superlokkus
  • 4,731
  • 1
  • 25
  • 57
  • What compilers do actually implement `std::execution::par_unseq` today? – Maxim Egorushkin Oct 17 '19 at 11:14
  • Don't know, don't care, since its much better to actually measure the different solutions that to speculate about the performance of one assembly. – Superlokkus Oct 17 '19 at 11:21
  • 1
    What have you measured then? – Maxim Egorushkin Oct 17 '19 at 11:23
  • With given size: Started operation Array Operations Array step Fill took 4 ms Operations Array step Add took 0 ms Operations Array step std::transform took 1 ms (1ms probably digit error) With 182400 * 942 Operations Array step Fill took 91 ms Operations Array step Add took 19 ms Operations Array step std::transform took 49 ms – Superlokkus Oct 17 '19 at 11:37
  • I am a bit dissapointed (btw for the higher size I used a vector of course) – Superlokkus Oct 17 '19 at 11:38
  • At least the both scale very good and yours is constant 2.5x as fast: 1824000 * 942 Started operation Array Operations Array step Fill took 873 ms Operations Array step std::transform took 506 ms Operations Array step Add took 194 ms – Superlokkus Oct 17 '19 at 11:42
  • But @jugnatius answer is as fast as yours (<10% diff) but is a lot cleaner Operations Array step Fill took 885 ms Operations Array step Add took 197 ms Operations Array step for_each took 207 ms Operations Array step std::transform took 510 ms – Superlokkus Oct 17 '19 at 11:50
  • "Cleaner" is rather subjective. Adding levels of indirection makes the code harder to read and may introduce unexpected performance regressions, as you just discovered. – Maxim Egorushkin Oct 17 '19 at 12:00
  • 1
    That's true, but @jugnatius answer is commonly perceived cleaner, as it is only one standard function call, with the same performance. Also no chance to use a wrong size/len. That why I upvoted him. – Superlokkus Oct 17 '19 at 12:29
-1

You can create a struct (or class) that will hold that parameter that is common to all of the elements in your array.

struct nameIt
{
    uint8_t* arr;
    uint8_t delta;
}