7

I use vImageConvert_RGB888toPlanar8 and vImageConvert_Planar8toRGB888 from Accelerate.framework to convert RGB24 to BGR24, but when the data need to transform is very big, such as 3M or 4M, the time need to spend on this is about 10ms. So some one know some fast enough idea?.My code like this:

- (void)transformRGBToBGR:(const UInt8 *)pict{
rgb.data = (void *)pict;

vImage_Error error = vImageConvert_RGB888toPlanar8(&rgb,&red,&green,&blue,kvImageNoFlags);
if (error != kvImageNoError) {
    NSLog(@"vImageConvert_RGB888toARGB8888 error");
}

error = vImageConvert_Planar8toRGB888(&blue,&green,&red,&bgr,kvImageNoFlags);
if (error != kvImageNoError) {
    NSLog(@"vImagePermuteChannels_ARGB8888 error");
}

free((void *)pict);
}
Cameron Lowell Palmer
  • 21,528
  • 7
  • 125
  • 126
zhzhy
  • 461
  • 3
  • 17
  • When I saw your title I instantly think about accelerate.framework. But as you use it, I think there is no better way to do such a thing on iOS. – iGranDav Jul 27 '12 at 08:15
  • @iGranDav: using this framework in the way that OP does it is not a warranty of high speed. There's too much copying of the data. See my answer, there are detailed explanations and links to ARM site for this exact task. – Viktor Latypov Jul 27 '12 at 09:08
  • @iGranDav this planar8 call is totally wrong. If you want to swap bytes you should use permute. – Cameron Lowell Palmer Feb 25 '15 at 14:56

2 Answers2

8

With a RGB888ToPlanar8 call you scatter the data and then gather it once again. This is very-very-very bad. If the memory overhead of 33% is affordable, try using the RGBA format and permute the B/R bytes in-place.

If you want to save 33% percents, then I might suggest the following. Iterate all the pixels, but read only a multiple of 4 bytes (since lcm(3,4) is 12, that is 3 dwords).

uint8_t* src_image;
uint8_t* dst_image;

uint32_t* src = (uint32_t*)src_image;
uint32_t* dst = (uint32_t*)dst_image;

uint32_t v1, v2, v3;
uint32_t nv1, nv2, nv3;
for(int i = 0 ; i < num_pixels / 12 ; i++)
{
     // read 12 bytes
     v1 = *src++;
     v2 = *src++;
     v3 = *src++;
     // shuffle bits in the pixels
     // [R1 G1 B1 R2 | G2 B2 R3 G3 | B3 R4 G4 B4]
     nv1 = // [B1 G1 R1 B2]
      ((v1 >> 8) & 0xFF) | (v1 & 0x00FF0000) | ((v1 >> 16) & 0xFF) | ((v2 >> 24) & 0xFF);
     nv2 = // [G2 R2 B3 G3]
       ...
     nv3 = // [R3 B4 G4 R4]
       ...
     // write 12 bytes
     *dst++ = nv1;
     *dst++ = nv2;
     *dst++ = nv3;
}

Even better can be done with NEON intrinsics.

See this link from ARM's website to see how the 24-bit swapping is done.

The BGR-to-RGB can be done in-place like this:

void neon_asm_convert_BGR_TO_RGB(uint8_t* img, int numPixels24)
{
    // numPixels is divided by 24
    __asm__ volatile(
        "0:                \n"
        "# load 3 64-bit regs with interleave: \n"
        "vld3.8      {d0,d1,d2}, [%0]   \n"
        "# swap d0 and d2 - R and B\n"
        "vswp d0, d2   \n"
        "# store 3 64-bit regs: \n"
        "vst3.8      {d0,d1,d2}, [%0]!      \n"
        "subs        %1, %1, #1       \n"
        "bne         0b            \n"
        :
        : "r"(img), "r"(numPixels24)
        : "r4", "r5"
     );
}
Viktor Latypov
  • 14,289
  • 3
  • 40
  • 55
  • so sad, In my test the ** Iterate all the pixels, but read only a multiple of 4 bytes (since lcm(3,4) is 12, that is 3 dwords)** is more slow than `RGB888ToPlanar8`, which cost as twice as `RGB888ToPlanar8`.In my experiment `RGB888ToPlanar8` cost 10ms, the other need 20ms. – zhzhy Jul 27 '12 at 11:03
  • Then use the assembly version, C version is _not_ a competitor to Accelerate framework – Viktor Latypov Jul 27 '12 at 11:09
  • Sorry for the errors in bit-shifting. ARM is both BE/LE, but by default it is usually little-endian. And my C code is definitely a big-endian one. – Viktor Latypov Jul 27 '12 at 12:04
  • `nv1 = ((v1 >> 16) & 0x0000FF) | (v1 & 0x0000FF00) | ((v1 << 16) & 0x00FF0000) | ((v2 << 16) & 0xFF000000);` `nv2 = ((v1 >> 16) & 0x0000FF00) | (v2 & 0xFF) | (v2 & 0xFF000000)| (v3 << 16 & 0x00FF0000);` `nv3 = ((v2 >> 16) & 0x000000FF) | (v3 >> 16 & 0x0000FF00) | (v3 << 16 & 0xFF000000) | (v3 | 0x00FF0000);` codes like this don't work, why? The output color is wrong. – zhzhy Jul 28 '12 at 06:36
  • I don't know arm assembly , then I use the function like this:` [self neon_asm_convert_BGR_TO_RGB:(UInt8 *)pict numPixel:videoWidth * videoHeight / 24 ];` but the color is not right. – zhzhy Jul 28 '12 at 06:48
  • Yes, I double-increment the destination pointer. I'll fix that. See the edit. – Viktor Latypov Jul 28 '12 at 10:32
  • Just a "!" sign at the end of vld3.8 instruction – Viktor Latypov Jul 28 '12 at 10:32
  • It greatly improves performance, now it just need 2ms, but just half of it transforms right, the color of other is not transformed,why? – zhzhy Jul 30 '12 at 01:08
  • It works, shoule call this assembly function like this:`[self neon_asm_convert_BGR_TO_RGB:(UInt8 *)pict numPixel:videoWidth * videoHeight * 3/ 24 ];`, not `[self neon_asm_convert_BGR_TO_RGB:(UInt8 *)pict numPixel:videoWidth * videoHeight / 24 ];` But the time need is 8ms, while call accelerate.framework cost 10ms. obviously the performance is improved! – zhzhy Jul 30 '12 at 02:08
0

Just swap the channels - BGRA to RGBA

- (void)convertBGRAFrame:(const CLPBasicVideoFrame &)bgraFrame toRGBA:(CLPBasicVideoFrame &)rgbaFrame
{
    vImage_Buffer bgraImageBuffer = {
        .width = bgraFrame.width,
        .height = bgraFrame.height,
        .rowBytes = bgraFrame.bytesPerRow,
        .data = bgraFrame.rawPixelData
    };

    vImage_Buffer rgbaImageBuffer = {
        .width = rgbaFrame.width,
        .height = rgbaFrame.height,
        .rowBytes = rgbaFrame.bytesPerRow,
        .data = rgbaFrame.rawPixelData
    };

    const uint8_t byteSwapMap[4] = { 2, 1, 0, 3 };

    vImage_Error error;
    error = vImagePermuteChannels_ARGB8888(&bgraImageBuffer, &rgbaImageBuffer, byteSwapMap, kvImageNoFlags);
    if (error != kvImageNoError) {
        NSLog(@"%s, vImage error %zd", __PRETTY_FUNCTION__, error);
    }
}
Cameron Lowell Palmer
  • 21,528
  • 7
  • 125
  • 126