1

Assume There are four nested loop with different loop counter and conditions. Is there any way to tell the compiler (icc,gcc and clang) that transform all loop to one loop?

N=128; M=128; P=3; Q=3; //All these variables are constant
for (n=0; n<N; n++){
    for(m=0; m<M; m++){
        temp=0;
        for(p=0; p<P; p++){ 
            for(q=0; q<Q; q++){
                temp += kernel[p][q] * input[n+p][m+q];
            }
        }
        output[n][m]=temp;
    }
}

To be transformed to:

for(;;)
    //computations...

In my experience this is useful when you rely on auto-vectorization. If there is a way to transform the two nested loops that will work as well. some thing that solved this question but with hand written codes. I have a program and you can see it here in godbolt.

ikegami
  • 367,544
  • 15
  • 269
  • 518
Amiri
  • 2,417
  • 1
  • 15
  • 42
  • How would you transform it to a single loop? What magic are you expecting to happen, somehow to have a compiler reduce N\*M\*P\*Q computations by itself? If you can't, why should it? – kabanus Sep 24 '17 at 19:54
  • If it would be possible to transform two nested loops that works too. – Amiri Sep 24 '17 at 19:55
  • `-funroll-all-loops`? – MarkWeston Sep 24 '17 at 19:55
  • @MarkWeston Doesn't that usually decrease performance? I think OP wants an increase. – kabanus Sep 24 '17 at 19:57
  • 2
    No, I wouldn't expect `--funroll-all-loops` to do what the OP asks. In the first place, the "all" in that option is about which loops are *candidates* for unrolling -- they include loops whose number of iterations cannot be completely determined a compile time. Not all loops that are candidates are unrolled. In the second place, if it truly did unroll all loops, then the OP would be left with zero loops, not one. – John Bollinger Sep 24 '17 at 19:59
  • @MarkWeston -funroll-all-loops does not fully unroll – Amiri Sep 24 '17 at 20:00
  • @Martin I'm sorry I don't know what can unroll even fuller than `-funroll-all-loops`. – MarkWeston Sep 24 '17 at 20:01
  • @Martin, compiler-directed optimizations are for when you don't care about the details of how the compiler generates fast code, and furthermore it is not essential that the compiler produce the fastest code possible. If it is important to you that a specific loop restructuring be performed, then your best bet is to do it yourself, in the source. – John Bollinger Sep 24 '17 at 20:02
  • I'm wating for `godbolt.` to give me a link to see it in each compiler and see what `-funroll-all-loops` does – Amiri Sep 24 '17 at 20:02
  • @JohnBollinger, unfortunately i have to do it by my self but if there was a way to reconstruct all loops I think it would help auto-vectorizer too. – Amiri Sep 24 '17 at 20:07
  • You at least want to make sure the variables `N, M, P, Q` are defined `const`. – aschepler Sep 24 '17 at 20:24
  • They are define as constants. `#define N 128`, etc – Amiri Sep 24 '17 at 20:25
  • Just factor out the multiply from the inner loops, optimization will do (most of) the rest unless you can assure alignments and array dimensions. – technosaurus Sep 24 '17 at 20:38
  • @technosaurus, Could you take a look at the linked question and provide another answer as you think it will work? array are aligned – Amiri Sep 24 '17 at 20:50

1 Answers1

5

I have no idea why you'd want to, but you can do it manually.

int accumulator;
for (int i=0; i<N*M*P*Q; ++i) {
    int n = i;
    int q = n % Q;  n /= Q;
    int p = n % P;  n /= P;
    int m = n % M;  n /= M;

    if (!p && !q)
       accumulator = 0;

    accumulator += kernel[p][q] * input[n+p][m+q];

    if (!p && !q)
        output[n][m] = accumulator;
}

Two loops makes a little more sense.

for (int i=0; i<N*M; ++i) {
    int n = i / M;
    int m = i % M;

    int accumulator = 0;
    for (int j=0; j<P*Q; ++j) {
        int p = j / Q;
        int q = j % Q;
        accumulator += kernel[p][q] * input[n+p][m+q];
    }    

    output[n][m] = accumulator;
}
Amiri
  • 2,417
  • 1
  • 15
  • 42
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 3
    I suggest to OP to examine the compiler's auto-vectorization output for this and his original code and compare. – kabanus Sep 24 '17 at 20:17
  • Thanks, I've compared. It makes the gcc and clang to not vectorize but icc has vectorized the two nested loops! – Amiri Sep 24 '17 at 20:33
  • 1
    since the number of iterations is 3 it probably won't vectorize, but you could at least factor out the multiplication from the inner loop. (2 loops with the multiply factored out would probably be faster than a single loop anyhow) – technosaurus Sep 24 '17 at 20:35