0

I have a 496*O(N^3) loop. I am performing a blocking optimization technique where I'm operating 2 images at a time instead of 1. In raw terms, I am unrolling the outer loop. (The non-unrolled version of the code is as shown below: ) b.t.w I'm using Intel Xeon X5365 machine that has 8 cores and it has 3GHz clock, 1333MHz bus frequency, Shared 8MB L2( 4 MB shared between every 2 core), L1-I 32KB,L1-D 32KB .

for(imageNo =0; imageNo<496;imageNo++){
for (unsigned int k=0; k<256; k++)
{
double z = O_L + (double)k * R_L;
for (unsigned int j=0; j<256; j++)
{
    double y = O_L + (double)j * R_L;

    for (unsigned int i=0; i<256; i++)
    {
        double x[1] = {O_L + (double)i * R_L} ;             
        double w_n =  (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11])  ;
        double u_n =  ((A_n[0] * x[0] + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n);
        double v_n =  ((A_n[1] * x[0] + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n);                      

        for(int loop=0; loop<1;loop++)
        {
            px_x[loop] = (int) floor(u_n);
            px_y[loop] = (int) floor(v_n);
            alpha[loop] = u_n - px_x[loop] ;
            beta[loop]  = v_n - px_y[loop] ;
        }
       if(px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)
            {
                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x )
                    ///////////////////(i,j) pixels ///////////////////////////////
                    pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[0] =0.0;                                    

                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x)
                    /////////////////// (i+1, j) pixels/////////////////////////
                    pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
                else
                    pixel_1[2] = 0.0;       
            }
            else{
                pixel_1[0] =0.0;                                    
                pixel_1[2] =0.0;                                    
            }

            if( px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
            {

                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x)
                    pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[1] = 0.0;


                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x)
                    pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
                else 
                    pixel_1[3] = 0.0;
            }
            else{
                pixel_1[1] = 0.0;
                pixel_1[3] = 0.0;
            }

                pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0]  * pixel_1[1]
                +  alpha[0]  * (1.0 - beta[0]) * pixel_1[2]   +  alpha[0]  *  beta[0]  * pixel_1[3];                    

            f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * pix_1);
}

}
}

I profiled the results using Intel Vtune-2013 (Using binary created from gcc-4.1) and I can see that there is 40% reduction in memory bandwidth usage which was expected because 2 images are being processed for every iteration.(f_L store operation causes 8 bytes of traffic for every voxel). This accounts to 11.7% reduction in bus cycles! Also, since the block size is increased in the inner loop, the resource stalls decrease by 25.5%. These 2 accounts for 18% reduction in response time. The mystery question is, why are instruction retired increased by 7.9%? (Which accounts for increase in response time by 6.51%) - Possible reason I could this of is: 1. Since the number of branch instructions increase inside the block (and core architecture has 8 bit global history) retired branch instruction increased by 2.5%( Although, mis-prediction remained the same! I know, smells fishy right?!!). But I am still missing answer for the rest 5.4%! Could anyone please shed me light in any direction? I'm completely out of options and No way to think. Thanks a lot!!

artless noise
  • 21,212
  • 6
  • 68
  • 105
quantumshiv
  • 97
  • 10
  • 2
    You say the instruction count has increased - it would be helpful to see the original code for comparison. You may also want to compare static number of instructions per iteration in the produced assembly, maybe the compiler has less luck optimizing the new code. – Leeor May 03 '14 at 09:15
  • I disassembled the code and could see its the similar instructions that's been replicated twice. I am thinking that compiler is producing same instructions but hardware prefetchers are generating more instructions to load pixel values from 2 different images at very large (1248 x 960 pixels x 4 bytes) base address locations. – quantumshiv May 06 '14 at 00:55
  • 2
    HW prefetchers don't count as instructions, loop unrolling shouldn't increase the dynamic instruction count. Isn't there any vectorization done? – Leeor May 06 '14 at 07:07
  • No. There is no vectorization done. And I am not assigning any gcc optimization flags either. Idk what it is, but something smells fishy with data being accessed from 2 different locations for every iteration. But I can't even guess. – quantumshiv May 06 '14 at 13:47
  • Since you haven't posted any asm, presumably the hand-unrolled version actually compiles to more asm instructions. Maybe it ran out of registers and had to spill/reload some locals to the stack? Maybe something totally different. It doesn't sound like anything to worry about, unless you look at the asm for the inner loop and see that you could have done better by hand. Modern CPUs are designed with wide pipelines to handle the crap thrown at them by imperfect compilers, so you don't have to hand-tune asm to get near-optimal performance in most cases. – Peter Cordes Aug 24 '16 at 21:25

0 Answers0