How to perform fast software binning of an image in C++?

Question

I am trying to perform fast binning of a 2D image (stored in row-major order in 1D array). The image is 12 bit, Thus, I have used the uint16_t data type.

What binning means is if done 2x2, then 2 pixels in each direction (x and y) becomes one pixel, or essentially the one super pixel contains the mean of those 4 pixels.

I have written the following code to calculate the mean of pixels and store them in superpixel while avoiding integer overflow. However, The divide by "DIVISIONFACTOR" to avoid the overflow slows the function down as the code has to perform more divisions (which are expensive compared to other arithmetic operations). The code runs significantly faster if I add values of all pixels (thus potential overflow can happen if the sum of pixels exceeds 65535) and then divide the sum for a superpixel by "DIVISIONFACTOR" only once.

To be more precise, The code runs at 0.75 ms with the overflow-avoiding code and at 0.13 ms with code that has the potential for overflow (owing to fewer division operations).

Please suggest ways to optimize further. I am using intel C++ compiler, please don't shy away from suggesting any Intel specific technique.

Thank you in advance.

Regards,

Harsh

Image Definitions: WIDTH and HEIGHT are actual image dimensions received from Camera and NX and NY are final dimensions.

#define NX 256
#define NY 256
#define WIDTH 768
#define HEIGHT 768
#define BINNINGFACTORWIDTH (WIDTH / NX)
#define BINNINGFACTORHEIGHT (HEIGHT / NY)
#define DIVISIONFACTOR (BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT)

Code without overflow:

int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k] / DIVISIONFACTOR);
                }
            }
        }
    }

    return 0;
}

Code with potential of overflow:

int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += image[l * HEIGHT + k];
                }
            }
            binnedImage[j * NX + i] /= DIVISIONFACTOR;
        }
    }

    return 0;
}

Full executable code:

#include <iostream>
#include <chrono>
#define NX 256
#define NY 256
#define WIDTH 768
#define HEIGHT 768
#define BINNINGFACTORWIDTH (WIDTH / NX)
#define BINNINGFACTORHEIGHT (HEIGHT / NY)
#define DIVISIONFACTOR (BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT)

using namespace std;

int binning(uint16_t* image, uint16_t* binnedImage);


int main() {
    chrono::high_resolution_clock::time_point t0, t1;
    chrono::duration<double> dt;

    double totalTime = 0;
    uint64_t count = 0;
    uint32_t i;
    uint16_t *image = (uint16_t*) malloc(sizeof(uint16_t) * WIDTH * HEIGHT);

    for (i=0; i < WIDTH * HEIGHT; i++) {
        image[i] = 4095;
    }

    uint16_t *binnedIimage = (uint16_t*) calloc(sizeof(uint16_t), NX * NY);

    while(1) {
        t0 = chrono::high_resolution_clock::now();

        binning(image, binnedIimage);

        t1 = chrono::high_resolution_clock::now();

        dt = chrono::duration_cast<chrono::duration<double>>(t1 - t0);

        totalTime += dt.count();

        count += 1;

        if (count % 3000 == 0) {
            cout<<"Time (ms): "<< totalTime * 1000 / count<<"\r";
            cout.flush();
        }
    }

    return 0;
}


int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k]/ DIVISIONFACTOR);
                }
            }
        }
    }

    return 0;
}

_"Please suggest ways to optimize further."_ It looks like your question is better placed at https://codereview.stackexchange.com then. — πάντα ῥεῖ, Apr 22 '23 at 11:38
Jumping all over the image like this isn't very cache friendly. Just bin each scanline and accumulate as you go. Do the division at the very end in a single linear pass over `binnedImage`, which will likely help the compiler optimize with vectorized division. — paddy, Apr 22 '23 at 11:58
@paddy that only works if you have a memory region to store the intermediate reduction – if you can do it in-place of the original image, that's fine, but I'm not sure going through a larger array twice is better. But you're definitely right, keeping L1 hot does the trick here. At these sizes, though, that'll probably happen anyways, and things will just start to go sideways once the image size exceeds cache sizes. — Marcus Müller, Apr 22 '23 at 12:04
They already have memory for it: `binnedImage`. It can be used as an accumulator. In-place is also fine. I'm talking about once through the large image, then once through the small one. The divide will just be a shift, if bin sizes are a power of 2. — paddy, Apr 22 '23 at 12:06
@MarcusMüller and paddy, thanks for the suggestion. I understand that if an array is accessed sequentially it results in fewer cache misses. But can you suggest, in my case, how should I accumulate sequentially and divide. Essemtially, element (0, 0) of final image will have mean of (0, 0), (0, 1), (1, 0), (1, 1) elements of original (larger) image, which by nature makes it non-sequential. Please elaborate more. — Harsh M, Apr 22 '23 at 12:38
I already told you. You do it one scanline at a time, calculating partial sums on one row. In the case of 2x2, you'd then just do one more row, now your accumulator has the sum of each superpixel. — paddy, Apr 22 '23 at 12:51
Adding 4 12-bit numbers will never overflow. You can add 16 of them without overflow. — Cris Luengo, Apr 22 '23 at 13:16
Your version without overflow may result in inaccurate results because of truncation of the division result. For 3x3 binning this could be off by 8! E.g. binned values: `(8, 8, 8) (8, 8, 8) (8, 8, 8)` divided and truncated individually this results in `0`, but the expected result would be `8`... — fabian, Apr 22 '23 at 14:22
Let's keep square size a power of 2. Then before division through right shift, add precalculated constant `(1<<(2*n-1))` to create the round to nearest effect. — Red.Wave, Apr 22 '23 at 18:35

Marcus Müller · Answer 1 · 2023-04-22T12:00:37.757

4

This binning is just a bad 1/2-scaling operator. You can do better, image-wise. This will alias like hell. Essentially, unless you have done the math to prove that an 2×2 average is appropriate for the spectrum of your image, you're probably not doing the right scaling.

Anyways.

Your PC is able to do short multiplications just as fast as int32_t multiplications, and you could get rid of your numerical overflow potential just doing the summing into an int, then saving the result of the scaling into the target image. This would also have the advantage of giving the compiler a local variable to optimize – which potentially increases its ability to emit
I'm not willing to bet this too strongly, but % 3000 is an actual integer modulo, and is much more work then multiplying a few integers. So, you're skewing your benchmark by doing that. Maybe just compare the lowest 12 bits with 4096, if (!(count & ((1<<13)-1))) {…}.
It's a bit stupid to get the current time after every binning, add it to the elapsed time, and then do the next binning. Getting the current time before the thing you want to benchmark, getting it after, subtracting the two and summing up the results of that takes way more time than just having a loop over 3000 invocations of binning, and doing one time-getting before and after. At this point, this might contribute very significantly to your time measurement. I wouldn't trust your measurement at this point.
- I'd strongly suggest you just use Google Benchmark or some other micro-benchmarking tooling instead.

Stylistically, Code-Quality-wise, lots to improve:

Your binning functions return an int, without actually returning anything useful. Make them void, not int, and remove the return at the end of the function.
Let your code-formatter run on your code. The arbitrarily distributed spacing around operators, especially comparison operators, makes it easy to hide bugs from a human reader.
You don't need WIDTH, HEIGHT, NX or NY to be constants. There's little difference between looping up to a compile-time constant and a value that gets calculated. So, don't use constants for these, just pass them as arguments. Especially, NX can be computed from WIDTH knowing BINNINGFACTORWIDTH, or vice versa. Your compiler can do constants folding. Keeping constants around that can contradict each other is just a recipe for bugs. DIVISIONFACTOR really doesn't need to exist as a constant at all.

Whoever taught you C++ didn't write C++. This is all how you'd write a C89 program, not a post-C++11 program, as you're otherwise doing.

Instead of #define CONSTANT use constexpr int CONSTANT = 42; to actually get foot-gun-free constants. Expert programmers that decide these algorithms need to run on a variety of scaling factors will instead probably opt to use template parameters, so that they don't have to have different code files with different constant definitions for one binning that does scale-by-3 and one that does scale-by-2.
Don't use malloc/calloc and pointer casts; your code, leaking that memory, is a good illustration of why you shouldn't do that. Instead, a simple std::vector<uint16_t> image(WIDTH*HEIGHT); is easier to read, does the same memory allocation, but also deallocates memory on destruction. You'd have a function void binning(const std::vector<uint16_t>& image, std::vector<uint16_t>& binnedImage) instead of int binning(uint16_t* image, uint16_t* binnedImage).
int i, j, k, l; just... don't. Make loop variable local to their loop. C99 and every version of C++ supports that: for (int j=0; j< NY; j++) {… is better, since it shows you where j lives, and in more complex code, it also allows the compiler to reason more easily about optimization of local variables (since it doesn't need to keep them around after the loop)

edited Apr 22 '23 at 12:00

answered Apr 22 '23 at 11:45

Marcus Müller

34,677
4
53
94

Thanks for suggestions. yes, you are right, I am a novice in c++, my domain is different (solar physics). We are trying to acquire images from a fast sCMOS camera and do processing. But unlike older CCDs, fast cameras do not have binning support on hardware and pixel size also is small. The requirement of software binning is the FFT is faster with shorter arrays and we can live with the degraded resolution. What is a better (and fast) 1/2 scaling operator? – Harsh M Apr 22 '23 at 12:49
1

what are your actual image sizes? What library are you using to do FFTs of image data? I ask because 768×768 is so small that good scaling down would be pretty efficiently implemented actually doing fast convolution in Fourier domain. And, honestly, doing a 2D 768×768 FFT is pretty quick; from the top of my head, it's probably about as fast as your scaling operation. – Marcus Müller Apr 22 '23 at 12:55
we have camera frame rate 655 fps, thus my processing must finish in 1.5 ms. for 128x128, intel mkl fft takes 0.3-0.4 ms and for 256x256, it takes 0.9-1 ms. Thus we are binning to 256x256. I also need 0.5 ms to actually perform the action (voltage to actuator to correct computed shift) after image shift is computed. Thus binning has to happen no more than 0.1-0.3 ms max, else I will not get frame rate. In next comment link is my solution of intel fft. – Harsh M Apr 22 '23 at 13:45
https://stackoverflow.com/questions/76063052/how-to-calculate-fft-forward-and-backward-multiple-times-in-fast-frame-rate-a – Harsh M Apr 22 '23 at 13:46
1

so, this looks like you're doing this all in a single thread, which seems unwise, since you then only make use of a single core of your CPU, while it's perfectly fine to work on multiple images on different CPU cores at the same time. – Marcus Müller Apr 22 '23 at 13:51
It is a feedback loop system, though I have created a separate thread for voltage to actuator, in hope that while a image is being acquired/processed, voltage will be pushed to actuator in parallel. I get uint16_t pointer from hardware, I do not know how to give the same pointer to another thread to have my fft calculation in parallel. – Harsh M Apr 22 '23 at 14:45
sorry for not getting back to you earlier! So, indeed, do the optimizations that fabian recommends. Seeing you need to shift around around 773 MB of image data per second (assuming your raw input images are 768×768 px), I'm not comfortable enough recommending something without knowing your system in much more depth – in the end, you'd want to get the single-core bottleneck out of the way as quickly as possible, and I'd say you'd distribute with a lockless single-writer-many-readers queue but the time your binning takes just seems unreasonably long, even with the slightly inelegant memory acces – Marcus Müller Apr 22 '23 at 18:34
1

…access. I'll be honest, scaling, FFT'ing and processing images would be something a GPU would be much better at than a CPU; have you considered offloading parts of your processing to a GPU? Then again, as said, your binning shouldn't take much longer than just moving the full picture worth of memory around (which you would still need to do to get the images onto the GPU); but right now, binning takes much longer. – Marcus Müller Apr 22 '23 at 18:36
Binning is definitely not the highest quality resizing algorithm out there, but it's not as bad as you make it sound. It's essentially the same as using a sensor with bigger pixels. If it causes aliasing that's a problem that should be solved with optical band limiting (blurring) before the image reaches the sensor. – Mark Ransom Apr 23 '23 at 03:29
@MarkRansom there is a difference though, the older CCDs which supported binning up to 4x4 or 8x8 did binning before digitization. In our case, we are using fast sCMOS cameras (each pixel has its own ADU), and do not support binning. Thus we are doing software binning after digitization. – Harsh M Apr 24 '23 at 03:24
@HarshM Mark still has a point there: the fact that the image needs to be sufficiently bandlimited before decimation (i.e., ignoring N-1 out of N pixels in each direction) can be achieved in different ways. It can be done analog adding up of image pixels prior to digitization, it can be done by taking the sum of pixels, and doing that for every Nth position in each direction only, like you do here. Or it can be done by blurring the optical phenomenon. Important is that prior to the step where you "throw away" image signal, the bandwidth of the signal is lower than half of the resulting ... – Marcus Müller Apr 24 '23 at 07:26
... sampling rate (which in image terms is "pixels per dimension"). Why I said binning is not a great scaler: the job of binning is to bandlimit and decimate. The bandlimitation is fine with a *moving average*, and that is just a convolution with a rectangle of width N in spatial domain. Then only every N pixel in both directions of that convolution is kept. (Of course, you just don't calculate the other values of the convolved image than what you keep). But we know that convolution with a rectangle in spatial domain is equivalent to point-wise multiplication with the Fourier transform in the… – Marcus Müller Apr 24 '23 at 07:32
… frequency domain. Now, the rectangle's shape in frequency domain happens to be a sin(x)/x "star" with amplitude maxima every 1/(2N)+k/N, k being every integer. What you set it to do was bandlimitation to the lowest ±1/(2N) of frequencies, and you kind of did that, but you also get all these *sidelobes* from the unwanted maxima which lead to *aliasing* of signal from higher frequencies into your decimated image! – Marcus Müller Apr 24 '23 at 07:39
Blurring done optically *can* have a different shape than such a rectangular cutout filter as this binning. But you can also do better in software, by weighting pixels differently for different relative positions. – Marcus Müller Apr 24 '23 at 07:44
@MarcusMüller Thank you for the insightful comment. It made me read more about it. So as I understand, we need to blur the image, filter out high frequencies which cannot be represented by the final binned dimensions, (either optically or in software). Now taking FFT and doing filtering will be expensive for me (because I have mentioned i can only spare max 0.3 ms). Do you think putting camera slightly off-focus can reduce the aliasing? I can experiment on how much defocus I need to do. Can you also suggest ways to deduce the different weighing coefficients while binning as you suggested? – Harsh M Apr 24 '23 at 09:51
again, I'm a bit surprised a 768×768 FFT should take much longer than doing this in spatial domain; 0.3ms is a long time!. Yes, off-focusing can work to reduce the bandwidth – but you might also reduce the quality of your observation (the 2D impulse response of an out-of-focus imaging, it's *point spread function* depends on the Fourier transform of the optical aperture). So that might make your signal better, or worse. – Marcus Müller Apr 24 '23 at 09:59
1

@HarshM I did realize that your goal is to do this in software. I was merely pointing out my opinion that doing this in software shouldn't deliver worse results than doing it in hardware, In particular it should be the same as using a sensor with fewer but larger pixels. – Mark Ransom Apr 25 '23 at 00:17

fabian · Accepted Answer · 2023-04-22T14:24:39.160

If you don't need to keep the original image content, doing one iteration doing horizontal binning followed by another iteration doing a vertical binning can be provide a performance boost of > 50%. (Even if the source cannot hold sufficient large values, you could use a separately allocated array to store intermediate results; likely you can determine the maximum array size before starting image conversions, so you don't need to allocate memory during the run of the function itself.)

Source

 1  2  3 | 10 11 12 | ...
 4  5  6 | 13 14 15 | ...
 7  8  9 | 16 17 28 | ...
--------------------------- ...
...

Step 1

 6 | 33 | ...
15 | 42 | ...
24 | 51 | ...
--------------------------- ...
...

Step 2

45 | 126 | ...
--------------------------- ...
...

Additional recommendations

Use constexpr variables instead of preprocessor defines.
Add static_asserts, to make sure overflow doesn't happen. Also add static_asserts for stuff that could go wrong when modifying the code without much of a thought like choosing a WIDTH that isn't divisible by NX.
Use a benchmarking framework instead of manually doing the timing. Otherwise aggressive optimization could get rid of the binning call alltogether, the result is never read. Also doing the timing yourself is hard, since the compiler is allowed to reorder your logic as long as the observable outcome is the same.
Use the fast versions of integral types for intermediate results.

Here is a implementation using google benchmark.

#include <benchmark/benchmark.h>

#include <cstdlib>
#include <cstdint>
#include <iostream>
#include <type_traits>
#include <memory>

namespace
{
struct Deallocator
{
    void operator()(void* mem) const
    {
        std::free(mem);
    }
};

constexpr unsigned WIDTH = 768;
constexpr unsigned HEIGHT = 768;

constexpr unsigned NX = 256;
constexpr unsigned NY = 256;

static_assert(WIDTH % NX == 0);
static_assert(HEIGHT % NY == 0);

constexpr auto BINNINGFACTORWIDTH = WIDTH / NX;
constexpr auto BINNINGFACTORHEIGHT = HEIGHT / NY;

constexpr auto DIVISIONFACTOR = BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT;

class ImageAllocationFixture : public benchmark::Fixture
{
protected:
    std::unique_ptr<uint16_t[], Deallocator> image;
    std::unique_ptr<uint16_t[], Deallocator> binnedImage;
public:
    void SetUp(const ::benchmark::State& state)
    {
        image.reset(static_cast<uint16_t*>(std::malloc(sizeof(uint16_t) * WIDTH * HEIGHT)));

        for (unsigned i = 0; i < WIDTH * HEIGHT; ++i) {
            image[i] = 4095;
        }

        binnedImage.reset(static_cast<uint16_t*>(calloc(sizeof(uint16_t), NX * NY)));
    }

    void TearDown(const ::benchmark::State& state)
    {
        image.reset();
        binnedImage.reset();
    }
};

}

namespace with_overflow
{

void binning(uint16_t* image, uint16_t* binnedImage) {
    int i, j, k, l;

    for (j = 0; j < NY; j++) {
        for (i = 0; i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k = i * BINNINGFACTORWIDTH; k < i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l = j * BINNINGFACTORHEIGHT; l < j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += image[l * HEIGHT + k];
                }
            }
            binnedImage[j * NX + i] /= DIVISIONFACTOR;
        }
    }

}

}

namespace without_overflow
{

void binning(uint16_t* image, uint16_t* binnedImage) {
    int i, j, k, l;

    for (j = 0; j < NY; j++) {
        for (i = 0; i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k = i * BINNINGFACTORWIDTH; k < i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l = j * BINNINGFACTORHEIGHT; l < j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k] / DIVISIONFACTOR);
                }
            }
        }
    }
}

}

namespace bin_separately
{


void binning(uint16_t* const image, uint16_t* const binnedImage)
{
    // just some stuff for static_assert
    using PixelValueType = std::remove_cvref_t<decltype(*image)>;

    constexpr PixelValueType AllOnes = ~static_cast<PixelValueType>(0);
    constexpr unsigned BitCount = 12;
    constexpr uint64_t PixelInValueMax = static_cast<PixelValueType>(~(AllOnes << BitCount)); // use 64 bit to prevent overflow issues
    constexpr uint64_t PixelTypeMax = (std::numeric_limits<PixelValueType>::max)();
    // end static_assert stuff


    {
        // compress horizontally
        static_assert(PixelInValueMax * BINNINGFACTORWIDTH <= PixelTypeMax,
            "cannot compress horizontally without risking overflow");

        auto out = image;
        for (auto inPos = image, end = image + WIDTH * HEIGHT; inPos != end;)
        {
            uint_fast16_t sum = 0;
            for (unsigned i = 0; i != BINNINGFACTORWIDTH; ++i)
            {
                sum += *(inPos++);
            }
            *(out++) = sum;
        }
    }

    {
        // compress vertically, divide and write to out

        //read pointers
        uint16_t* inPoss[BINNINGFACTORHEIGHT];
        for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
        {
            inPoss[i] = image + (NX * i);
        }

        for (auto out = binnedImage, end = binnedImage + NX * NY; out != end;) // for all output rows
        {
            for (auto const rowEnd = out + NX; out != rowEnd;)
            {
                uint_fast16_t sum = 0;

                static_assert(PixelInValueMax * BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT <= (std::numeric_limits<decltype(sum)>::max)(),
                    "type of sum needs replacement, since it cannot hold the result of adding up all source pixels for one target pixel");

                for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
                {
                    sum += *(inPoss[i]++);
                }
                *(out++) = sum / DIVISIONFACTOR;
            }

            // we advanced each pointer by one row -> advance by (BINNINGFACTORHEIGHT - 1) more
            for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
            {
                inPoss[i] += NX * (BINNINGFACTORHEIGHT - 1);
            }
        }
    }

}

}

BENCHMARK_F(ImageAllocationFixture, WithOverflow)(benchmark::State& st)
{
    for (auto _ : st)
    {
        with_overflow::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_F(ImageAllocationFixture, WithoutOverflow)(benchmark::State& st)
{
    for (auto _ : st)
    {
        without_overflow::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_F(ImageAllocationFixture, BinSeparately)(benchmark::State& st)
{
    for (auto _ : st)
    {
        bin_separately::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_MAIN();

Output for my compiler/machine (MSVC 19.34.31937 x86_64, /O2)

Run on (12 X 3593 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 512 KiB (x6)
  L3 Unified 16384 KiB (x2)
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
ImageAllocationFixture/WithOverflow        996287 ns       941685 ns          896
ImageAllocationFixture/WithoutOverflow    1171365 ns      1098633 ns          640
ImageAllocationFixture/BinSeparately       350364 ns       353021 ns         2036

How to perform fast software binning of an image in C++?

2 Answers2

Linked