How to make a C compiler to transform all nested loop to a single loop

Question

Assume There are four nested loop with different loop counter and conditions. Is there any way to tell the compiler (icc,gcc and clang) that transform all loop to one loop?

N=128; M=128; P=3; Q=3; //All these variables are constant
for (n=0; n<N; n++){
    for(m=0; m<M; m++){
        temp=0;
        for(p=0; p<P; p++){ 
            for(q=0; q<Q; q++){
                temp += kernel[p][q] * input[n+p][m+q];
            }
        }
        output[n][m]=temp;
    }
}

To be transformed to:

for(;;)
    //computations...

In my experience this is useful when you rely on auto-vectorization. If there is a way to transform the two nested loops that will work as well. some thing that solved this question but with hand written codes. I have a program and you can see it here in godbolt.

How would you transform it to a single loop? What magic are you expecting to happen, somehow to have a compiler reduce N\*M\*P\*Q computations by itself? If you can't, why should it? — kabanus, Sep 24 '17 at 19:54
If it would be possible to transform two nested loops that works too. — Amiri, Sep 24 '17 at 19:55
@MarkWeston Doesn't that usually decrease performance? I think OP wants an increase. — kabanus, Sep 24 '17 at 19:57
No, I wouldn't expect `--funroll-all-loops` to do what the OP asks. In the first place, the "all" in that option is about which loops are *candidates* for unrolling -- they include loops whose number of iterations cannot be completely determined a compile time. Not all loops that are candidates are unrolled. In the second place, if it truly did unroll all loops, then the OP would be left with zero loops, not one. — John Bollinger, Sep 24 '17 at 19:59
@Martin I'm sorry I don't know what can unroll even fuller than `-funroll-all-loops`. — MarkWeston, Sep 24 '17 at 20:01
@Martin, compiler-directed optimizations are for when you don't care about the details of how the compiler generates fast code, and furthermore it is not essential that the compiler produce the fastest code possible. If it is important to you that a specific loop restructuring be performed, then your best bet is to do it yourself, in the source. — John Bollinger, Sep 24 '17 at 20:02
I'm wating for `godbolt.` to give me a link to see it in each compiler and see what `-funroll-all-loops` does — Amiri, Sep 24 '17 at 20:02
@JohnBollinger, unfortunately i have to do it by my self but if there was a way to reconstruct all loops I think it would help auto-vectorizer too. — Amiri, Sep 24 '17 at 20:07
You at least want to make sure the variables `N, M, P, Q` are defined `const`. — aschepler, Sep 24 '17 at 20:24
Just factor out the multiply from the inner loops, optimization will do (most of) the rest unless you can assure alignments and array dimensions. — technosaurus, Sep 24 '17 at 20:38
@technosaurus, Could you take a look at the linked question and provide another answer as you think it will work? array are aligned — Amiri, Sep 24 '17 at 20:50

score 5 · Answer 1 · edited Sep 24 '17 at 20:36

5

I have no idea why you'd want to, but you can do it manually.

int accumulator;
for (int i=0; i<N*M*P*Q; ++i) {
    int n = i;
    int q = n % Q;  n /= Q;
    int p = n % P;  n /= P;
    int m = n % M;  n /= M;

    if (!p && !q)
       accumulator = 0;

    accumulator += kernel[p][q] * input[n+p][m+q];

    if (!p && !q)
        output[n][m] = accumulator;
}

Two loops makes a little more sense.

for (int i=0; i<N*M; ++i) {
    int n = i / M;
    int m = i % M;

    int accumulator = 0;
    for (int j=0; j<P*Q; ++j) {
        int p = j / Q;
        int q = j % Q;
        accumulator += kernel[p][q] * input[n+p][m+q];
    }    

    output[n][m] = accumulator;
}

edited Sep 24 '17 at 20:36

Amiri

2,417
1
15
42

answered Sep 24 '17 at 20:15

ikegami

367,544
15
269
518

3

I suggest to OP to examine the compiler's auto-vectorization output for this and his original code and compare. – kabanus Sep 24 '17 at 20:17
Thanks, I've compared. It makes the gcc and clang to not vectorize but icc has vectorized the two nested loops! – Amiri Sep 24 '17 at 20:33
1

since the number of iterations is 3 it probably won't vectorize, but you could at least factor out the multiplication from the inner loop. (2 loops with the multiply factored out would probably be faster than a single loop anyhow) – technosaurus Sep 24 '17 at 20:35

How to make a C compiler to transform all nested loop to a single loop

1 Answers1