Fast vectorized pixel-wise operations on images

Question

I want to measure the similarity degree between two grayscale same sized images using mean square error. I can't use any framework which is not a part of macOS SDK(e.g. OpenCV, Eigen). Simple realization of this algorithm without vectorization looks like this:

vImage_Buffer imgA;
vImage_Buffer imgB;

NSUInteger mse = 0;

unsigned char *pxlsA = (unsigned char *)imgA.data;
unsigned char *pxlsB = (unsigned char *)imgB.data;

for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
    NSUInteger d = pxlsA[i] - pxlsB[i]);
    mse += d * d;
}

Is there some way to do this without loop, in more vectorized way? Maybe something like:

mse = ((imgA - imgB) ^ 2).sum();

I don't speak objective-c, but isn't `sqrt(d * d)` just `abs(d)`? Also, I am pretty sure MSE doesn't need `sqrt` — Vlad Feinstein, Oct 28 '20 at 00:25
What do you expect such vectorization to do? Loops are not slow in compiled languages. A statement like `imgA - imgB` requires an overloaded minus operator that contains a loop just like yours. In interpreted languages this can be a lot faster, because the loop then is compiled, but that doesn’t play a role here. In compiled languages “vectorization” often refers to using SIMD processor instructions. If you turn on the machine-specific optimizations in your compiler, it will use those if available. — Cris Luengo, Oct 28 '20 at 01:25
@CrisLuengo I hope that vectorization will increase the speed of execution. I don't understand what is going on under the hood of vectorization, but I have a feeling that it has to increase the speed. I will read about SIMD, thank you. — borista, Oct 28 '20 at 11:47
If I was going to do this, and needed high performance, I’d do it using a Core Image filter - but this is not an easy path to take. Even with Accelerate, which has a C interface, it’s going to require some effort on your part. — David H, Oct 28 '20 at 11:56

score 1 · Answer 1 · answered Oct 29 '20 at 17:39

The answer to this question is stored in vDSP library, which is part of macOS SDK. https://developer.apple.com/documentation/accelerate/vdsp

vDSP - Perform basic arithmetic operations and common digital signal processing routines on large vectors.

In my situation I have not really big vectors, but still.

Firstly, you need to convert unsigned char * to float *, and btw it is a significant moment, I don't know how to do this not in loop. Then you need two vDSP function: vDSP_vsbsbm and vDSP_sve.

vDSP_vsbsm - Multiplies the difference of two single-precision vectors by a second difference of two single-precision vectors.

vDSP_sve - Calculates the sum of values in a single-precision vector.

So the final code looks like that:

float *fpxlsA = (float *)malloc(imgA.height * imgA.width * sizeof(float));
float *fpxlsB = (float *)malloc(imgB.height * imgB.width * sizeof(float));
float *output = (float *)malloc(imgB.height * imgB.width * sizeof(float));

for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
    fpxlsA[i] = (float)(pxlsA[i]);
    fpxlsB[i] = (float)(pxlsB[i]);
}    

vDSP_vsbsbm(fpxlsA, 1, fpxlsB, 1, fpxlsA, 1, fpxlsB, 1, output, 1, imgA.height * imgB.width);
float sum;
vDSP_sve(output, 1, &sum, imgA.height * imgB.width);

free(output);
free(fpxlsA);
free(fpxlsB);

So, this code did exactly what I wanted and in a more vectorized form. But the result isn't good enough. Comparing performances of the loop approach and vDSP approach, vDSP is two times faster if there isn't any additional memory allocation. But in reality, where additional memory allocation takes place, loop approach is slightly faster.

score 0 · Answer 2 · answered Oct 28 '20 at 00:37

0

This appears to be part of Mac OS: https://developer.apple.com/documentation/accelerate

answered Oct 28 '20 at 00:37

Vlad Feinstein

10,960
1
12
27

skaak · Answer 3 · 2020-10-28T13:05:11.740

0

Nice and fast using pointer arithmetic way to loop that would be as follows ...

int d;

size_t i = imgA.height * imgA.width;

while ( i -- )
{
  d = ( int )(*pxlsA++) - ( int )(*pxlsB++);
  mse += d * d;
}

EDIT

Ooops since those are unsigned char's and since we calculate the difference we need to use signed integers to do so.

And another edit - must use pxls... here, don't know what img... is.

edited Oct 28 '20 at 13:05

answered Oct 28 '20 at 12:44

skaak

2,988
1
8
16

AFAIK C arrays indexing has pointer arithmetic under the hood, so speed will be the same – borista Oct 28 '20 at 13:00
Yes this will be close to the fastest ... I was thinking maybe with bitwise operators you can go faster but I doubt it. – skaak Oct 28 '20 at 13:01
seems like an approach I was looking for is SIMD instructions, as Cris Luengo mentioned. I will read about them and write an answer here. – borista Oct 28 '20 at 15:30

Fast vectorized pixel-wise operations on images

3 Answers3