The answer to this question is stored in vDSP library, which is part of macOS SDK.
https://developer.apple.com/documentation/accelerate/vdsp
vDSP - Perform basic arithmetic operations and common digital signal processing routines on large vectors.
In my situation I have not really big vectors, but still.
Firstly, you need to convert unsigned char *
to float *
, and btw it is a significant moment, I don't know how to do this not in loop. Then you need two vDSP function: vDSP_vsbsbm
and vDSP_sve
.
vDSP_vsbsm - Multiplies the difference of two single-precision vectors by a second difference of two single-precision vectors.
vDSP_sve - Calculates the sum of values in a single-precision vector.
So the final code looks like that:
float *fpxlsA = (float *)malloc(imgA.height * imgA.width * sizeof(float));
float *fpxlsB = (float *)malloc(imgB.height * imgB.width * sizeof(float));
float *output = (float *)malloc(imgB.height * imgB.width * sizeof(float));
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
fpxlsA[i] = (float)(pxlsA[i]);
fpxlsB[i] = (float)(pxlsB[i]);
}
vDSP_vsbsbm(fpxlsA, 1, fpxlsB, 1, fpxlsA, 1, fpxlsB, 1, output, 1, imgA.height * imgB.width);
float sum;
vDSP_sve(output, 1, &sum, imgA.height * imgB.width);
free(output);
free(fpxlsA);
free(fpxlsB);
So, this code did exactly what I wanted and in a more vectorized form. But the result isn't good enough. Comparing performances of the loop approach and vDSP approach, vDSP is two times faster if there isn't any additional memory allocation. But in reality, where additional memory allocation takes place, loop approach is slightly faster.