How to enhance this YUV420P to RGB conversion in C/C++?

Question

I'm trying to write an YUV420P to RGB888 for when I have the entire thing as one giant buffer with Y (of size width*height) then Cr (of size width*height/4) then Cb (of size width*height/4). The output should be an RGB buffer with size width*height*3.

I think my function below is very inefficient. For example, I use the ceiling function (shouldn't it return an int? In my case it's returning a double, why?) and I've never seen any color conversion function use this function. But this is the way I found to get the corresponding Cr and Cb to each Y.

JNIEXPORT void JNICALL Java_com_example_mediacodecdecoderexample_YuvToRgb_YUVtoRBGA2(JNIEnv * env, jobject obj, jbyteArray yuv420sp, jint width, jint height, jbyteArray rgbOut)
{
    //ITU-R BT.601 conversion
    //
    //     R = 1.164*(Y-16)+1.596*(Cr-128)
    //     G = 1.164*(Y-16)-0.392*(Cb-128)-0.813*(Cr-128)
    //     B = 1.164*(Y-16)+2.017*(Cb-128)
    //
    int Y;
    int Cr;
    int Cb;
    int R;
    int G;
    int B;
    int size = width * height;
    //After width*height luminance values we have the Cr values
    size_t CrBase = size;
    //After width*height luminance values + width*height/4 we have the Cb values
    size_t CbBase = size + width*height/4;
    jbyte *rgbData = (jbyte*) ((*env)->GetPrimitiveArrayCritical(env, rgbOut, 0));
    jbyte* yuv = (jbyte*) (*env)->GetPrimitiveArrayCritical(env, yuv420sp, 0);

    for (int i=0; i<size; i++) {
        Y  = rgbData[i] - 16;
        Cr = rgbData[CrBase + ceil(i/4)]  - 128;
        Cb = rgbData[CbBase + ceil(i/4)]  - 128;
        R = 1.164*Y+1.596*Cr;
        G = 1.164*Y-0.392*Cb-0.813*Cr;
        B = 1.164*Y+2.017*Cb;
        yuv[i*3] = R;
        yuv[i*3+1] = G;
        yuv[i*3+2] = B;
    }

    (*env)->ReleasePrimitiveArrayCritical(env, rgbOut, rgbData, 0);
    (*env)->ReleasePrimitiveArrayCritical(env, yuv420sp, yuv, 0);
}

I'm doing this because I haven't found a function that does exactly this and I need one for a MediaCodec decoded buffer. But even if there's one, I'd like to know what can be done to improve my function, just to learn.

UPDATE:

I modified the code based on the answer below in order for it to work with ByteBuffer:

JNIEXPORT void JNICALL Java_com_lucaszanella_mediacodecdecoderexample_YuvToRgb_YUVtoRBGA2(JNIEnv * env, jobject obj, jobject yuv420sp, jint width, jint height, jobject rgbOut)
{
    //ITU-R BT.601 conversion
    //
    //     R = 1.164*(Y-16)+1.596*(Cr-128)
    //     G = 1.164*(Y-16)-0.392*(Cb-128)-0.813*(Cr-128)
    //     B = 1.164*(Y-16)+2.017*(Cb-128)
    //

    char *rgbData = (char*)(*env)->GetDirectBufferAddress(env, rgbOut);
    char *yuv = (char*)(*env)->GetDirectBufferAddress(env, yuv420sp);

    const int size = width * height;

    //After width*height luminance values we have the Cr values
    const size_t CrBase = size;

    //After width*height luminance values + width*height/4 we have the Cb values
    const size_t CbBase = size + width*height/4;

    for (int i=0; i<size; i++) {
        int Y  = yuv[i] - 16;
        int Cr = yuv[CrBase + i/4]  - 128;
        int Cb = yuv[CbBase + i/4]  - 128;

        double R = 1.164*Y+1.596*Cr;
        double G = 1.164*Y-0.392*Cb-0.813*Cr;
        double B = 1.164*Y+2.017*Cb;

        rgbData[i*3] = (R > 255) ? 255 : ((R < 0) ? 0 : R);
        rgbData[i*3+1] = (G > 255) ? 255 : ((G < 0) ? 0 : G);
        rgbData[i*3+2] = (B > 255) ? 255 : ((B < 0) ? 0 : B);
    }
}

however it's crashing. I don't see anything being written outside of boundary. Anyone have any idea?

UPDATE:

Code above works if we call it with a direct byte buffer. Won't work if the buffer is not direct.

Added

    if (rgbData==NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "TRACKERS", "%s", "RGB data null");
    }

    if (yuv==NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "TRACKERS", "%s", "yuv data null");
    }
    if (rgbData==NULL || yuv==NULL) {
        return;
    }

for safety.

Anyways, color is not correct:

Is it just me, but but shouldn't you be reading from the yuv array and writing to the rgbData array? Your code does the opposite. Otherwise, I think you might be exceeding an array size. Your code is writing back RGB bytes to the `yuv` array that backs the java array`yuv240sp`. — selbie, Jul 12 '20 at 00:57
Since you're interested in speed/efficiency ... Never use floating point for YCbCr conversion to RGB. Scaled integer arithmetic is faster and is what commercial products use (e.g.) `int R = ((1164 * Y) + (1596 * Cr)) / 1000;` — Craig Estey, Jul 12 '20 at 02:01
As this is for education, you may want to look into OpenGL ES shaders, or perhaps Renderscript, [Converting camera YUV-data to ARGB with Renderscript](https://stackoverflow.com/q/13509360/295004). — Morrison Chang, Jul 12 '20 at 05:20
@MorrisonChang thanks, I already know how to do shader color conversion, but I need a C version just to test some things — Guerlando OCs, Jul 12 '20 at 05:34
You color issue is likely because you have the byte order of your input or output array reversed. Some codecs actually swap RGB to be BGR. Or it's possibly you got Cr and Cb reversed. — selbie, Jul 12 '20 at 07:16

selbie · Accepted Answer · 2020-07-12T00:57:04.990

2

Is it just me, but but shouldn't you be reading from the yuv array and writing to the rgbData array? You actually have it reversed in your implementation.

There's not need to invoke ceil on an integer expression such as i/4. And when you implement an image processing route, invoking a function call on every pixel is just going to kill performance (been there, done that). Maybe the compiler can optimize it out, but why take that chance.

So change this:

    Cr = rgbData[CrBase + ceil(i/4)]  - 128;
    Cb = rgbData[CbBase + ceil(i/4)]  - 128;

To this:

    Cr = rgbData[CrBase + i/4]  - 128;
    Cb = rgbData[CbBase + i/4]  - 128;

The only other thing to be wary of is that you may want to clamp R, G, and B to be in the 8-bit byte range before assigning back to the yuv array. Those math equations can produce results < 0 and > 255.

Another micro-optimization is to declare all your variables within the for-loop block so the compiler has more hints about optimizing on it as temporaries. And declaring some of your other constants as const May I suggest:

JNIEXPORT void JNICALL Java_com_example_mediacodecdecoderexample_YuvToRgb_YUVtoRBGA2(JNIEnv * env, jobject obj, jbyteArray yuv420sp, jint width, jint height, jbyteArray rgbOut)
{
    //ITU-R BT.601 conversion
    //
    //     R = 1.164*(Y-16)+1.596*(Cr-128)
    //     G = 1.164*(Y-16)-0.392*(Cb-128)-0.813*(Cr-128)
    //     B = 1.164*(Y-16)+2.017*(Cb-128)
    //
    const int size = width * height;
    //After width*height luminance values we have the Cr values

    const size_t CrBase = size;
    //After width*height luminance values + width*height/4 we have the Cb values

    const size_t CbBase = size + width*height/4;

    jbyte *rgbData = (jbyte*) ((*env)->GetPrimitiveArrayCritical(env, rgbOut, 0));
    jbyte* yuv= (jbyte*) (*env)->GetPrimitiveArrayCritical(env, yuv420sp, 0);

    for (int i=0; i<size; i++) {
        int Y  = yuv[i] - 16;
        int Cr = yuv[CrBase + i/4]  - 128;
        int Cb = yuv[CbBase + i/4]  - 128;

        int R = 1.164*Y+1.596*Cr;
        int G = 1.164*Y-0.392*Cb-0.813*Cr;
        int B = 1.164*Y+2.017*Cb;

        rgbData[i*3] = (R > 255) ? 255 : ((R < 0) ? 0 : R);
        rgbData[i*3+1] = (G > 255) ? 255 : ((G < 0) ? 0 : G);
        rgbData[i*3+2] = (B > 255) ? 255 : ((B < 0) ? 0 : B);
    }

    (*env)->ReleasePrimitiveArrayCritical(env, rgbOut, rgbData, 0);
    (*env)->ReleasePrimitiveArrayCritical(env, yuv420sp, yuv, 0);
}

Then the only left to do is just to compile with max optimizations on. The compiler will take care of the rest.

After that, investigating SIMD optimizations, which some compilers offer as a compiler switch (or enabled via pragma).

edited Jul 12 '20 at 00:57

answered Jul 12 '20 at 00:44

selbie

100,020
15
103
173

One thing I missed: `const size_t CbBase = size + width*height/4;`, please change so no one copies the wrong code :) – Guerlando OCs Jul 12 '20 at 00:47
Ok I made a modification as a new answer which makes everything work with ByteBuffer instead of `byte[]` however it's crashing somewhere. If you have an idea I'd happily open a bounty in 2 days to reward you, otherwise someone might come up with a solution. I'm thinking but we're not acessing things out of bound here – Guerlando OCs Jul 12 '20 at 01:08
Where is it crashing? What line and what does the exception record say? – selbie Jul 12 '20 at 01:38
Can you validate the length of `rgbOut` has >= `3*width*height elements` in it before doing the main loop? And that your pointers aren't null? – selbie Jul 12 '20 at 01:40
indeed `rgbData` is null but it does not make sense because I do `ByteBuffer frame = ByteBuffer.allocate(outputFrame.width*outputFrame.height*3); YuvToRgb.YUVtoRBGA2(outputFrame.byteBuffer, outputFrame.width, outputFrame.height, frame);` – Guerlando OCs Jul 12 '20 at 02:09
Ok, changing to `allocateDirect` worked, however image is not really colorful, gonna post the output – Guerlando OCs Jul 12 '20 at 02:13
Please take a look at the image. Also, it's not in realtime, and it's an snapdragon 815, so slower phones will be really slow – Guerlando OCs Jul 12 '20 at 02:16
The average time for color conversion is 77000 microseconds and `1 second /77000 microseconds = 12.99` so it looks like it's not sufficient and much less sufficient in slower phones – Guerlando OCs Jul 12 '20 at 02:24
The image looks like you got U and V mixed up. Either that, or you RGB backwards. Some codecs expect BGR. – selbie Jul 12 '20 at 04:56
Allocating a buffer per frame will also be a performance killer. I suspect your input arrays are also getting re-allocated per frame. – selbie Jul 12 '20 at 04:58
`rgba.rgb = texture(samplerX, TexCoord).rgb; FragColor = rgba` and I'm filling with `GLES20.glTexSubImage2D(GLES20.GL_TEXTURE_2D, 0, 0, 0, outputFrame.width, outputFrame.height, GLES20.GL_RGB, GLES20.GL_UNSIGNED_BYTE, frame);` – Guerlando OCs Jul 12 '20 at 05:52
I don't know what that means. But what I'm trying to suggest is that something about your code, outside of the core yuv to rgb loop, is capping your frames/sec. I suspect you're allocating new buffers, arrays, or texture objects in between each frame instead of re-using the ones you have. Because the core loop of iterating over each pixel isn't going to get much faster. – selbie Jul 12 '20 at 07:14
Gonna check on that, but the 77000 microseconds for color conversion aren't clearly too much? With that, in the best scenario I'd get 13 fps – Guerlando OCs Jul 12 '20 at 07:44
How did you benchmark 77000 microseconds per frame? Did you measure it over the span of multiple calls? I wouldn't take any benchmark allocation seriously unless if was compiled with `-O3` and measured with at least 100 frames to account for variance. – selbie Jul 12 '20 at 07:47
It was not a really accurate but I left it running and it wont ever get below 70000 or above 120000, I think its enough just to have an idea of the magnitude – Guerlando OCs Jul 12 '20 at 09:01
I'm sure that this would be way more suitable for https://codereview.stackexchange.com/ – Kamiccolo Jul 17 '20 at 01:05
Thanks for accepting. I had been thinking about the performance issues you mentioned. One thing I realized is that `Cb` and `Cr` are getting recomputed mostly redundantly between loop iterations. You only need to compute `1.596*Cr`, `392*Cb-0.813*Cr`, and `2.017*Cb` on every 4th pixel. I would modify the loop to compute 4 pixels per loop iteration and advance `i` by 4 each interation. Make sense? – selbie Jul 22 '20 at 08:43

score 0 · Answer 2 · answered Jul 12 '20 at 01:06

A little modification to selbie's answer which uses ByteBuffer which is more useful since it's what Java produces when it decodes.

JNIEXPORT void JNICALL Java_com_example_mediacodecdecoderexample_YuvToRgb_YUVtoRBGA2(JNIEnv * env, jobject obj, jobject yuv420sp, jint width, jint height, jobject rgbOut)
{
    //ITU-R BT.601 conversion
    //
    //     R = 1.164*(Y-16)+1.596*(Cr-128)
    //     G = 1.164*(Y-16)-0.392*(Cb-128)-0.813*(Cr-128)
    //     B = 1.164*(Y-16)+2.017*(Cb-128)
    //
    const int size = width * height;
    
    //After width*height luminance values we have the Cr values
    const size_t CrBase = size;
    
    //After width*height luminance values + width*height/4 we have the Cb values
    const size_t CbBase = size + width*height/4;

    jbyte *rgbData = (*env)->GetDirectBufferAddress(env, rgbOut);
    jbyte *yuv = (*env)->GetDirectBufferAddress(env, yuv420sp);

    for (int i=0; i<size; i++) {
        int Y  = yuv[i] - 16;
        int Cr = yuv[CrBase + i/4]  - 128;
        int Cb = yuv[CbBase + i/4]  - 128;

        int R = 1.164*Y+1.596*Cr;
        int G = 1.164*Y-0.392*Cb-0.813*Cr;
        int B = 1.164*Y+2.017*Cb;

        rgbData[i*3] = (R > 255) ? 255 : ((R < 0) ? 0 : R);
        rgbData[i*3+1] = (G > 255) ? 255 : ((G < 0) ? 0 : G);
        rgbData[i*3+2] = (B > 255) ? 255 : ((B < 0) ? 0 : B);
    }
}

score 0 · Answer 3 · answered Jul 12 '20 at 23:09

0

regarding:

*I use the ceiling function (shouldn't it return an int? In my case it's returning a double, why?)*

here is the syntax:

double ceil(double x);

Notice the returned type is double

MAN page for ceil()

answered Jul 12 '20 at 23:09

user3629249

16,402
1
16
17

but why is it double if it should return an integer always? – Guerlando OCs Jul 14 '20 at 02:11
the function: ceil() returns the integer part of the double value +1.0. I.E. it discards the fraction. It NEVER returns a 'int' – user3629249 Jul 14 '20 at 05:00
I see, but if it always returns the integer part of a double value, why the return type is double? – Guerlando OCs Jul 15 '20 at 14:14
because that is how the function: `ceil()` is designed. – user3629249 Jul 15 '20 at 14:24

Marek R · Answer 4 · 2020-07-16T11:30:06.250

Do not do it yourself! Do not do that directly in C++! The only proper approach is to use hardware acceleration for that. You will save lots of battery.

Basically you can utilize OpenGL to that and it will use hardware in your behalf.

Long long time ago I did this for iOS and I'm sure solution for Android will be quite similar. Sadly I left code behind (in old company) so I can't provide you example code. If I find something useful then I will update this answer. In my code YUV (and couple others color formats) was rendered directly on openGL view and OpenGL did required conversion.

Now I'm just pointing finger on OpenGL since other answers are doing this directly on CPU what is a bad choice since it will consume battery a lot and you will never achieve desired performance this way.

Edit: I've found similar question on SO with some example: https://stackoverflow.com/a/17110754/1387438

Disclaimer: didn't verified that this example is best approach, but this is a good way to start to look for better solutions.

If for some reason you need do this in C++ code anyway then drop floating point operations in favor of operations on integer types.

thank you very much my friend, I already know and have a shader color conversion working on open gl for lots of color formats, but I wanted to use C++ to test some pure RGB888 rendering on OpenGL. I wanted C++ conversion most for testing and possibly for images in the future — Guerlando OCs, Jul 16 '20 at 20:23

How to enhance this YUV420P to RGB conversion in C/C++?

4 Answers4