How to parallelize this for loop for rapidly converting YUV422 to RGB888?

Question

I am using v4l2 api to grab images from a Microsoft Lifecam and then transferring these images over TCP to a remote computer. I am also encoding the video frames into a MPEG2VIDEO using ffmpeg API. These recorded videos play too fast which is probably because not enough frames have been captured and due to incorrect FPS settings.

The following is the code which converts a YUV422 source to a RGB888 image. This code fragment is the bottleneck in my code as it takes nearly 100 - 150 ms to execute which means I can't log more than 6 - 10 FPS at 1280 x 720 resolution. The CPU usage is 100% as well.

for (int line = 0; line < image_height; line++) {
    for (int column = 0; column < image_width; column++) {
        *dst++ = CLAMP((double)*py + 1.402*((double)*pv - 128.0));                                                  // R - first byte           
        *dst++ = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0));    // G - next byte
        *dst++ = CLAMP((double)*py + 1.772*((double)*pu - 128.0));                                                            // B - next byte

        vid_frame->data[0][line * frame->linesize[0] + column] = *py; 

        // increment py, pu, pv here

    }

'dst' is then compressed as jpeg and sent over TCP and 'vid_frame' is saved to the disk.

How can I make this code fragment faster so that I can get atleast 30 FPS at 1280x720 resolution as compared to the present 5-6 FPS?

I've tried parallelizing the for loop across three threads using p_thread, processing one third of the rows in each thread.

for (int line = 0; line < image_height/3; line++) // thread 1
for (int line = image_height/3; line < 2*image_height/3; line++) // thread 2
for (int line = 2*image_height/3; line < image_height; line++) // thread 3

This gave me only a minor improvement of 20-30 milliseconds per frame. What would be the best way to parallelize such loops? Can I use GPU computing or something like OpenMP? Say spwaning some 100 threads to do the calculations?

I also noticed higher frame rates with my laptop webcam as compared to the Microsoft USB Lifecam.

Here are other details:

Ubuntu 12.04, ffmpeg 2.6
AMG-A8 quad core processor with 6GB RAM
Encoder settings:
- codec: AV_CODEC_ID_MPEG2VIDEO
- bitrate: 4000000
- time_base: (AVRational){1, 20}
- pix_fmt: AV_PIX_FMT_YUV420P
- gop: 10
- max_b_frames: 1

If you can afford some more bandwidth/memory to use RGBA8888 rather than RGB888, then it'll be a lot easier. — user3528438, Mar 30 '15 at 12:54

szatmary · Answer 1 · 2015-03-31T03:29:18.283

If all you care about is fps and not ms per frame (latency), another option would be a separate thread per frame.

Threading is not the only option for speed improvements. You could also perform integer operations as opposed to floating point. And SIMD is an option. Using an existing library like sws_scale will probably give you the best performance.

Mak sure you are compiling -O3 (or -Os).

Make sure debug symbols are disabled.

Move repeated operations outside the loop e.g.

// compiler cant optimize this because another thread could change frame->linesize[0]
    int row = line * frame->linesize[0]; 
    for (int column = 0; column < image_width; column++) {
            ...
            vid_frame->data[0][row + column] = *py;

You can precompute tables, so there is no math in the loop:

init() {
for(int py = 0; py <= 255 ; ++py)
for(int pv = 0; pv <= 255 ; ++pv)
    ytable[pv][py] =  CLAMP(pv + 1.402*(py - 128.0)); 
}    

for (int column = 0; column < image_width; column++) {
        *dst++ = ytable[*pv][*py];

Just to name a few options.

score 1 · Answer 2 · answered Apr 15 '15 at 12:59

I think unless you want to reinvent the painful wheel, using pre-existing options (ffmpeg' libswscale or ffmpeg's scale filter, gstreamer's scale plugin, etc.) is a much better option.

But if you want to reinvent the wheel for whatever reason, show the code you used. For example, thread startup is expensive, so you'd want to create the threads before measuring your looptime and reuse threads from frame-to-frame. Better yet is frame-threading, but that adds latency. This is usually ok but depends on your use case. More importantly, don't write C code, learn to write x86 assembly (simd), all previously mentioned libraries use simd for such conversions, and that'll give you a 3-4x speedup (since it allows you to do 4-8 pixels instead of 1 per iteration).

score 0 · Answer 3 · answered Mar 30 '15 at 12:55

0

You could build blocks of x lines and convert each block in a separate thread

answered Mar 30 '15 at 12:55

bazz-dee

687
5
23

Even one block takes 100% of the CPU core so say if I build 100 blocks across 100 threads, is it actually going to help? – vineet Mar 30 '15 at 12:59
I would create two times the number of threads as CPUs in your system. And do not create new threads for each new image frame as this takes time. Use a thread pool. – bazz-dee Mar 30 '15 at 13:26

score 0 · Answer 4 · answered Apr 14 '15 at 10:24

do not mix integer and floating point arithmetic!

char x;
char y=((double)x*1.5); /* ouch casting double<->int is slow! */
char z=(x*3)>>1;        /* fixed point arithmetic rulez */

use SIMD (though this would be easier if both input and output data were properly aligned...e.g. by using RGB8888 as output)
use openMP

an alternative that does not require any coding of the processing, would be to simply do your entire processing using a framework that does proper timestamping throughout the pipeline (starting at image acquisition time), and is hopefully optimized enough to deal with big data. e.g. gstreamer

MVTC · Answer 5 · 2015-04-15T06:26:02.970

Would something like this not work?

#pragma omp parallel for
for (int line = 0; line < image_height; line++) {
    for (int column = 0; column < image_width; column++) {
        dst[ ( image_width*line + column )*3    ] = CLAMP((double)*py + 1.402*((double)*pv - 128.0));                                                  // R - first byte           
        dst[ ( image_width*line + column )*3 + 1] = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0));    // G - next byte
        dst[ ( image_width*line + column )*3 + 2] = CLAMP((double)*py + 1.772*((double)*pu - 128.0));                                                            // B - next byte

        vid_frame->data[0][line * frame->linesize[0] + column] = *py; 

        // increment py, pu, pv here

    }

Of course you have to also handle incrementing py, py, pv part accordingly.

score 0 · Answer 6 · answered Apr 16 '15 at 07:18

Usually transformation of pixel format is performed with using of only integer variables. It's allow to prevent conversion between float point and integer variables. Also it's allow to use more effectively SIMD extensions of modern CPUs. For example, this is a code of conversion YUV to BGR:

const int Y_ADJUST = 16; 
const int UV_ADJUST = 128;
const int YUV_TO_BGR_AVERAGING_SHIFT = 13;
const int YUV_TO_BGR_ROUND_TERM = 1 << (YUV_TO_BGR_AVERAGING_SHIFT - 1); 
const int Y_TO_RGB_WEIGHT = int(1.164*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_BLUE_WEIGHT = int(2.018*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_GREEN_WEIGHT = -int(0.391*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_GREEN_WEIGHT = -int(0.813*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_RED_WEIGHT = int(1.596*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);

inline int RestrictRange(int value, int min = 0, int max = 255)
{
    return value < min ? min : (value > max ?  max : value);
}

inline int YuvToBlue(int y, int u)
{
    return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) + 
        U_TO_BLUE_WEIGHT*(u - UV_ADJUST) + 
        YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}

inline int YuvToGreen(int y, int u, int v)
{
    return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) + 
        U_TO_GREEN_WEIGHT*(u - UV_ADJUST) + 
        V_TO_GREEN_WEIGHT*(v - UV_ADJUST) + 
        YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}

inline int YuvToRed(int y, int v)
{
    return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) + 
        V_TO_RED_WEIGHT*(v - UV_ADJUST) + 
        YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}

This code is taken here (http://simd.sourceforge.net/). Also here there is a code optimized for different SIMDs.

How to parallelize this for loop for rapidly converting YUV422 to RGB888?

6 Answers6