increasing efficiency of following code using threads

Question

I'm using a machine having 8 cores and 32GB ram. In this machine, I'm running a code in c++ using VS2010 on Windows x64 which takes 3 days to complete 8 trees(8 is the number of outer threads). I searched for bottleneck and find out that crossCorrelate method takes around 75-80% of the time. Now, I'm trying to make that method more efficient, code is as follows:

int main(){
    int numThread = 8;
    //create threads, run build_tree method for each of them
    //and join after running all of them
}

// I'm creating 8 tree

void build_tree(int i){  //called millions of times 
    for(some_value to another_val){
        //do some stuff
        read_corresponding_matrices
        crossCorrelate(mat1,mat2);
    }
    //write the results to a file 
}

//each tree is working with its own data, no dependency between trees.

Mat crossCorrelate(Mat mat1_real, Mat mat2_real){
    Mat mat1, mat2,result;

    //1st multi-threading part  // around 20 ms
    Scalar mean1 = mean(mat1_real);
    subtract(mat1_real,(float)mean1[0],mat1);

    Scalar mean2 = mean(mat2_real);
    subtract(mat2_real,(float)mean2[0],mat2);
    //1st part ends

    Mat tilted_mat2 = flip_cross(mat2);

    Mat planes[] = {Mat_<float>(mat1), Mat::zeros(mat1.size(), CV_32F)};
    Mat planes2[] = {Mat_<float>(tilted_mat2), Mat::zeros(mat1.size(), CV_32F)};

    Mat complexI;

    //2nd multi-threaded part   //around 150 ms
    merge(planes, 2, complexI);                     
    dft(complexI, complexI);                        
    split(complexI, planes);                        

    merge(planes2, 2, complexI);            
    dft(complexI, complexI);                        
    split(complexI, planes2);
    //2nd m-t part ends 

    // do some operations with mat1, mat2, planes etc
    clock_t s11 = clock();
    cout << "total time diff " << s11-s1 << endl;

    return result;
}

This is the method that I want to make more efficient. This part takes around 600 ms for each call. What I thought is to make some independent parts of the method multi-threaded and found two places that can be written in parallel.

For this aim, I wrote two simple code for each (1st and 2nd m-t parts), and run those methods:

t1 = boost::thread( subtract_mean, mat1_real, mat1); 

subtract_mean(mat_ori, mat){
    Scalar mean1 = mean(mat_ori);
    subtract(mat_ori,(float)mean1[0],mat1);
}

similarly 2nd thread creates two thread for each dft.(dft_thread)

The code includes a lot of computations so, when I run it cpu usage becomes around 90%.

Before running with inner threads, I was expecting a better result however it is not.

Here are my question: Why does my code is working faster when I run without dft_thread and sub_thread? How can I make crossCorrelation faster? Could I use an inner thread, I used once, over and over by doing that would it make my code faster? Is there a clever way of inserting inner threads to my code?

EDIT: I did some new tests: I have no inner thread and checked what happens when the number of outer threads are 1-2-4-6-8 for tree size = 16. Here are the results:

numThread 1 ------ 2 ------ 4 ------ 6 ------ 8

Time takes 29 ----- 35 ----- 51 ----- 77 ----- 104 (in sec)

avg_time 29 ---- 17.5 ---- 12.7 ---- 12.8 ---- 13 (in sec)

I think this shows, I can on make 2.5 time faster with threads. I was expecting/thinking it is 5-6 times faster with 8 thread. Is it what it should have been? Am I doing something wrong or my understanding of threads fails?

EDIT2: I did one more test:

First one: running the code with 6 thread

The second one is copy the visual studio project 5 times and run 6 process at the same time all of them are running with one thread. (multithreading vs parallel processing)

multithreading takes 141 mins whereas, parallel processing takes 70 mins.

Note that: running one process with one thread takes 53 mins.

What could be the reason for that? Anybody seeing such an abnormal situation? I'm thinking both should be in the same speed (maybe multithreading is a bit more faster) as they are using same amount of resources, am I wrong?

Thanks,

Without too I/O operations, threading will not be a lot of difference than a single thread, in fact you may loose due to the context change in CPU. — Netwave, Jan 23 '14 at 08:55
I/O and memory are not boundry resource. Disk I/O can go up to 500 MB/sec. Outer threads increase speed by 5-6 times when I use 8 threads, inner threads speed down though. — smttsp, Jan 23 '14 at 08:58
@MartinJames, I'm thinking of the same, but is there a solution for that? For example, could I use the threads, I'm using for the 1st m-t in the 2nd m-t as well? is it a good idea to use it? By the way, I recently learnt multi-threading — smttsp, Jan 23 '14 at 09:04
I think it all comes down to merge, dft, split method implementations and amount of data you are using — mindo, Jan 23 '14 at 09:11
Assuming `dft` means Discrete Fourier Transform, presumably you're actually calculating it as an FFT internally? — Oliver Charlesworth, Jan 23 '14 at 09:33
@OliCharlesworth, yes it is doing fft and inverse fft after. Those are opencv functions and I'm not going to be able to do any change on those. — smttsp, Jan 23 '14 at 09:43
Can't you use a profiler? What happens when you make the function arguments const references? — Torsten Robitzki, Jan 23 '14 at 09:47
@TorstenRobitzki I didn't know profiler. Now, I will learn it and test my code using that. Thanks for recommending that tool. I have changed some of the arguments to `const` (the ones that can be changed to `const`) and for two thread it became from 35 sec to 34 sec, which doesn't show much. — smttsp, Jan 23 '14 at 10:06

Spektre · Answer 1 · 2015-10-19T09:37:57.993

well not really an answer but as comment it will not be very readable:

try to avoid any parameters/returns for any function that is called often

Well const parameters are not enough if it is possible use global variables instead its much faster then heap trashing. For example in crossCorrelate(mat1,mat2); the mat1,mat2 can be global (own for each thread of course). Parameters are in the best case scenario referenced by pointer. In that case its not a big deal but still can buy some time. In worse case its copied into new object on every call. When your matrices are big then it takes time. And also do not forget the constructor/destructor is called too ...
avoid dynamic allocation in often executed code

Allocate only once if possible. Modern C/C++ engines have pretty good memory managers so this will not buy much time but even 1-5% sometimes count
check DFT

As mentioned before it should be computed as DFFT. I am not sure if you have fast enough implementation of DFFT but if your input data is all the time with the same matrix size then you can pre-compute weights once and use them all the time. It will speed up the DFFT / IDFFT significantly.

BTW merge,dft,split can be rewritten too (to be in place and without parameters). Or you can use double buffering techniques (swap pointers on execute).

As you wrote you cannot go inside source so try to use different DFFT/IDFFT

What about NTT/INTT ? If your algorithm just use FFT for its properties then sometimes NTT is faster but if your input data is complex then you have no other choice.
you are reading matrices (I assume from some file)

Check performance of that. If it is a binary file than you have nothing to improve, but if it is in text form check the reading efficiency. For example WinAPI ini file reading is about 1000x times slower then efficiently written ini parser in C++ especially for big files.
you can try to improve performance by better thread management
- use threads count according to CPU count. If you have 4xCPU then 100 threads will not be faster than 4 threads but slower actually.
- You can change thread/process priority/class
- sometimes well placed Sleep() actually speed things up
- You can play with affinity masks to better profile which thread runs on which CPU, sometimes it helps.

PS. btw how big are your matrices ?

[Edit1]

When you go to parallel/multi-threading you are accessing N times more resources at once ...

single matrix of yours is 1K x 1K x float = 4 MB
after FFT you switch to complex so it became 8 MB
you are doing some operations on 2 matrices (A-B for example) so that is 16 MB
if you do not use in place algorithms (like A=A-B but C=A-B) then you are using 24 MB

So check for the CACHE size on you computer and there is your bottleneck (at least in my opinion). Also when you have matrix as operand or return value without reference (not pointer but object) then you can add 8 MB for each of them. Also consider the amount of recursion calls inside 2D (I)DFFT when N = 1024 !!! The amount of heap trashing is horrible.

Thanks for really detailed answer. I didn't know many things in your answer. My matrix size 1K*1K and it is float. I'm using binary file so, I cant do much about 4th as I don't know `DFT` and `FFT`, I just used them. DFT and IDFT are opencv implementations. I should get rid of my dynamic allocations, I did it a lot. And I will search about 5 to improve performance. Thanks a lot, it gave a lot of information. I did one more edit in my post. Could you please have a look at it if you could help me about it? — smttsp, Jan 28 '14 at 08:01
I do not see any new edit change in your question yet. (still editing?) Well I coded an DFFT/IDFFT NTT/INTT a lot and there are lot of implementations possible. Although OpenCV should have a real fast one I am not very sure of that. for example when I and my friend compared few Apps with the same goal one with OpenCV and one raw coded without any lib then OpenCV was much much slower (OCR apps with a lot of DCT/DFFT usage) The main reason is I think that OpenCV have universal FFT so no pre-computations are made ... instead I use pre allocated DCT/DFFT input data size which is a lot faster — Spektre, Jan 28 '14 at 08:17
The Idea is to precompute all weights once (when you have 'fixed' matrix size) so they are not needed to compute again. this speeds up FFT more than one should think. also in your case I will strongly consider hardcode FFT just for your matrix size so no recursion will be needed an very many operations will be gone ... I am confident that this approach will speed up your processing at least 5-20 times. — Spektre, Jan 28 '14 at 08:24

increasing efficiency of following code using threads

1 Answers1

Linked