OpenCV based programs optimization on minimal linux embedded systems

Question

I'm building my own Embedded Linux OS for Raspberry PI3 using Buildroot. This OS will be used to handle several applications, one of them performs objects detection based on OpenCV (v3.3.0).

I started with Raspbian Jessy + Python but it turned out that it takes a lot of time to execute a simple example, So I decided to design my own RTOS with Optimized features + C++ development instead of Python.

I thought that with these optimizations the 4 cores of RPI + the 1GB RAM will handle such applications. The problem is that even with these things, the simplest Computer Vision programs take a lot of time.

PC vs. Raspberry PI3 Comparaison

This is a simple program I wrote to have an idea of the order of magnitude of execution time of each part of the program.

#include <stdio.h>
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"

#include <time.h>       /* clock_t, clock, CLOCKS_PER_SEC */

using namespace cv;
using namespace std;

int main()
{
    setUseOptimized(true);
    clock_t t_access, t_proc, t_save, t_total;

    // Access time.
    t_access = clock();
    Mat img0 = imread("img0.jpg", IMREAD_COLOR);// takes ~90ms
    t_access = clock() - t_access;

    // Processing time
    t_proc = clock();
    cvtColor(img0, img0, CV_BGR2GRAY); 
    blur(img0, img0, Size(9,9));// takes ~18ms
    t_proc = clock() - t_proc;

    // Saving time
    t_save = clock();
    imwrite("img1.jpg", img0);
    t_save = clock() - t_save;

    t_total = t_access + t_proc + t_save;

    //printf("CLOCKS_PER_SEC = %d\n\n", CLOCKS_PER_SEC);

    printf("(TEST 0) Total execution time\t %d cycles \t= %f ms!\n", t_total,((float)t_total)*1000./CLOCKS_PER_SEC);
    printf("---->> Accessing  in\t %d cycles \t= %f ms.\n", t_access,((float)t_access)*1000./CLOCKS_PER_SEC);
    printf("---->> Processing in\t %d cycles \t= %f ms.\n", t_proc,((float)t_proc)*1000./CLOCKS_PER_SEC);
    printf("---->> Saving     in\t %d cycles \t= %f ms.\n", t_save,((float)t_save)*1000./CLOCKS_PER_SEC);

    return 0;
}

Results of Execution on an i7 PC

Results of Execution on Raspberry PI (Generated OS from Buildroot)

As you can see there is a huge difference. What I need is to optimize every single detail so that this example processing step occurs in "near" real-time at in a maximum 15ms processing time instead of the 44ms. So these are my questions:

How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Is there any other possibilities instead of OpenCV?
Should I use C instead of C++?
Any hardware improvements you recommend?

I think you have unrealistic expectations. Just consider that the i7 (if it's a low-end one) costs about an order of magnitude more than the whole Raspberry -- there must be some reason for that. (Similarly with power usage). You haven't specified which particular i7 you have, but I bet the clock speed is higher, it has more cache, wider and faster memory interface, wider SIMD registers... Do the research, run some standard benchmarks... — Dan Mašek, Nov 30 '17 at 17:20
It's difficult enough to develop a functional vision system. More difficult to make it also perform well, even on beefy hardware. Additional constrains increase that difficulty further. — Dan Mašek, Nov 30 '17 at 17:24
@DanMašek I see your point of view. I said that I would like to reach **15ms** instead of **44ms** (Considering only the processing part) is that unrealistic?? All what I'm trying is to reach the maximum capabilities of the Raspberry PI .. I will stop when I will have used all the weapons we have (All the 4 cores, VideoCore GPU, may be Overclocking, additional HW support ...). It's my method to learn how I can exploit all these. — noureddine-as, Nov 30 '17 at 19:21
[Looking at `blur`](https://github.com/opencv/opencv/blob/3.3.0/modules/imgproc/src/smooth.cpp#L1860) the existing optimizations in the code can take advantage of either OpenCL(not sure what the state is on RPi), OpenVX (not familiar with), or IPP (probably not on an ARM). Maybe OCL may be viable, if you can build OpenCV with it. The fallback implementation using `FilterEngine` doesn't look parallelized. So perhaps you could blur 4 smaller overlapping ROIs in parallel with threads. But there's some overhead to pay. — Dan Mašek, Nov 30 '17 at 19:53
BTW, when you're timing functions like this, make multiple measurements in a loop. The first time you call some functions can result in some overhead (e.g. libraries being loaded on demand). — Dan Mašek, Nov 30 '17 at 19:55
Your code is single threaded so it is not going to take advantage of the 4 cores. Try acquiring in one thread, and immediately passing the image to a second thread, and the the next image to a third thread and the one after to a fourth thread in a round-robin fashion, then your 44ms could become 1/3 of that, i.e. the 15ms you seek. — Mark Setchell, Nov 30 '17 at 21:02
@DanMašek Searching about OpenCL i found out that OpenCV3 is T-API so just by using UMat instead of Mat OpenCV handles automatically all the operations behind isn't it? Concerning the threading yeah, i've used this with the old Python applications and I am planning to use this too. Regarding the timing, is there any more accurate method to determine how much time it takes? — noureddine-as, Nov 30 '17 at 21:09
@MarkSetchell thanks for the idea. One constraint is that not all operations can be parallelized. For example when we talk about objects detection. We can use 3 tasks (threads), each one will detect a particular object, they all operate on the same shared object which is our image and there is no problem, after we join all the threads results in the same image. But when we talk about pre-processing, say for example (grayscaling -> resizing -> hist equalizing -> blurring) these need to happen in a sequence and therefore need more optimization work. — noureddine-as, Nov 30 '17 at 21:25
You could consider to look at something optimized for ARM architecture: [FastCV](https://developer.qualcomm.com/software/fastcv-sdk) (at least on Qualcomm device), or [ComputeLibrary](https://github.com/ARM-software/ComputeLibrary)? — Catree, Nov 30 '17 at 23:56
@Catree Thank you very much for the suggestions. Very interesting. Have u ever worked on these? — noureddine-as, Dec 01 '17 at 00:13

score 2 · Answer 1 · answered Dec 01 '17 at 06:38

Well as i understand, you want to get about 30-40fps. In case of your I7: it is fast and having tone of acceleration techniques enabled default by itel. In case of raspberry pi: well, we love it but it is slow, especially for image processing program.

How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?

You should include some acceleration library for arm and re-compiled opencv again with those features enabled.

How can I fully use the 4 Cores of RPI3 to fulfill the requirements?

Paralleling your code so it could run on 4 cores

Is there any other possibilities instead of OpenCV?

Ask your self first, what features do you need from OpenCV.

Should I use C instead of C++?

Changing language will not help you at all, stay and love C++. It is a beautiful language and very fast

Any hardware improvements you recommend?

How about other board with mali GPU supported. So you could run opencv code directly on GPU, that will boost up your speed a lot.

Thank you @gachiemchiep I found those OpenVX and ComputeLibrary very interesting. However I'm still documenting on how to use them. COncerning the GPUs, these are very expensive, that's why i'm trying to push what i have in my hands to the limits ^^ — noureddine-as, Dec 01 '17 at 22:48
Huhm i see. So in that case, you should try enable some hardware optimization first. For example, if you use opencv 2.4 see this link https://docs.opencv.org/2.4/doc/tutorials/introduction/crosscompilation/arm_crosscompile_with_cmake.html — Vu Gia Truong, Dec 05 '17 at 00:03

OpenCV based programs optimization on minimal linux embedded systems

PC vs. Raspberry PI3 Comparaison

1 Answers1