1

I'm recently doing this problem, taken directly and translated from day 1 task 3 of IOI 2010, "Quality of life", and I encountered a weird phenomenon.

I was setting up a 0-1 matrix and using that to calculate a prefix sum matrix in 1 loop:

for (int i = 1; i <= m; i++)
{
    for (int j = 1; j <= n; j++)
    {
        if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
        b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + lower[i][j];
    }
}

and I got TLE (time limit exceeded) on 4 tests (the time limit is 2.0s). While using 2 for loop seperately:

for (int i = 1; i <= m; i++)
{
    for (int j = 1; j <= n; j++)
    {
        if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
    }
}

for (int i = 1; i <= m; i++)
{
    for (int j = 1; j <= n; j++)
    {
        b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + lower[i][j];
    }
}

got me full AC (accepted).

As we can see from the 4 pictures here:

the 2 for-loops code generally ran a bit faster (even in accepted test cases), contrasting my logic that the single for-loop should be quicker. Why does this happened?

Full code (AC) : https://pastebin.com/c7at11Ha (Please ignore all the nonsense bit and stuff like using namespace std;, as this is a competitive programming contest).

  • Note : The judge server, lqdoj.edu.vn is built on dmoj.ca, a global competitive programming contest platform.
silverfox
  • 1,568
  • 10
  • 27

1 Answers1

2

If you look at assembly you'll see the source of the difference:

  1. Single loop:
{
    if (a[i][j] < x)
    {
        lower[i][j] = 0;
    }
    else
    {
        lower[i][j] = 1;
    }
    b[i][j] = b[i-1][j] 
            + b[i][j-1]
            - b[i-1][j-1]
            + lower[i][j];
}

In this case, there's a data dependency. The assignment to b depends on the value from the assignment to lower. So the operations go sequentially in the loop - first assignment to lower, then to b. The compiler can't optimize this code significantly because of the dependency.

  1. Separation of assignments into 2 loops:

The assignment to lower is now independent and the compiler can use SIMD instructions that leads to a performance boost in the first loop. The second loop stays more or less similar to the original assembly.

Alexander
  • 698
  • 6
  • 14
  • Thanks for the answer! I'm not really familiar with Assembly, so could you elaborate further what are "simd instructions" in this case and how they affect the running time here? I assume that it's the part from line 163-212 that make the difference, but I'm not really sure what it does. – silverfox Oct 20 '21 at 18:17
  • 1
    @silverfox Single Instruction Multiple Data. one instruction processes several input items packed into a special long register, e.g. sum several integers at once. in your case, for example, pcmpgtd instruction compares several array elements at once and movdqu saves all the results (0/1) into the destination. (instructions are for x86, arm has its own simd instructions) – Alexander Oct 20 '21 at 18:30