3

Article here

http://en.wikipedia.org/wiki/Reconfigurable_computing#Example_of_a_streaming_model_of_computation

Example of a streaming model of computation
Problem: We are given 2 character arrays of length 256: A[] and B[]. We need to compute the array C[] such that C[i]=B[B[B[B[B[B[B[B[A[i]]]]]]]]]. Though this problem is hypothetical, similar problems exist which have some applications.

Consider a software solution (C code) for the above problem:

for(int i=0;i<256;i++){
        char a=A[i];
        for(int j=0;j<8;j++)
                a=B[a];
        C[i]=a;
}

This program will take about 256*10*CPI cycles for the CPU, where CPI is the number of cycles per instruction.

Could this problem be optimized in an advanced compiler like Haskell GHC ?

Charles
  • 50,943
  • 13
  • 104
  • 142
est
  • 11,429
  • 14
  • 70
  • 118

1 Answers1

2

This wiki page doesn't make much sense (and I think it's also noted in the talk page there).

The example machine is quite meaningless as they ignore the fact that to pipeline the accesses, you would need a memory that can not only sustain 8 simultaneous requests (coming from the different pipelined stages), but also complete them in a single cycle. Banking or splitting the memory in any way wouldn't really work as they all access the same addresses of B.

You could stretch it a bit and say that you've cloned B into 8 different memory units, but then you'll have to find some more complicated controller to keep the coherency, otherwise you'll only be able to use them for reading. On the other hand, if you had this kind of memory, then the "CPU" they're competing against should be allowed to use it. If we had this banked memory, a modern CPU with out-of-order execution would be able to issue the following instructions for e.g., under the same assumption of 1 cycle per load: 1st cycle: load a[i], calculate i+1 2nd cycle: load a[i+1], load b[a[i]], calculate (i+1)+1 3nd cycle: load a[i+2], load b[a[i+1]], load b[b[a[i]]], calculate i+1+1+1 ... So it would essentially do just as well as the special pipeline they show, even with a basic compiler. Note that a modern CPU can look far ahead in the execution window to find independent operations, but if the compiler does loop unrolling (which is a basic feature supported in most languages) it could re-order the operations in a way that makes it easier for the CPU to issue them.

As for your question about compilers - you didn't specify which feature exactly you think can solve this. Generally speaking - these problems are very hard to optimize through a compiler since you can't mitigate the latency of the memory dependencies. In other words, you'll first have to access a[i], only then the CPU will have the address to access b[a[i]], only then it will have the address for b[b[a[i]]], and so on. There's not much the compiler can do in order to guess the content of memory not yet accessed (and even it it did speculate, it wouldn't be smart to use it for anything practical as it may change by the time the actual load arrives in program order).

This is similar to problem of "pointer chasing" where you traverse a linked list - the required addresses are not only unknown at compile time, but are also hard to predict at runtime, and may change.

I'm not saying this can't be optimized, but it would usually require some dedicated HW solution (such as the memory banking), or some fancy speculative algorithm that would be quite limited in its use. There are papers on the topic (mostly HW prefetching), for e.g. - http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=765944

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • thanks for the long explaination! I stumped upon that wikipedia article and wonders if the example code is something very unique to "reconfigurable-computing", turns out it isn't. – est Sep 23 '13 at 08:47
  • 1
    Well, "reconfigurable computing" is a very broad term, i'd use it to describe a machine that can repartition its resources according to need. Don't get me wrong - It **does** make sense to reconfigure things dynamically, like network routing on a multi-core chip (http://www.cs.cmu.edu/~phoenix/reconfigurable.html), or even for building pipelines dynamically, it's just not a good example because of the memory limitations, it they had some compute intensive code and reconfigured their ALUs - it would make perfect sense – Leeor Sep 23 '13 at 08:57
  • 1
    Don't really understand the question, so deleted my answer. And about building the expression statically, this `nested::at(a, b, i)` thing will expand `nested` templates and inline `at` functions in compile-time until we get `b[b[...deep-th...[a[i]]...]]`, this is it. Such a trick can be useful if compiler don't do the loop unrolling while we need it. – JJJ Sep 23 '13 at 08:59
  • 1
    @JJJ - ok, thanks. I understood that as a question about runtime optimization, assuming the compiler already wrote the code correctly. I also edited to include the loop unrolling part you mentioned (it's not imperative in OOO machines but it helps a lot) – Leeor Sep 23 '13 at 09:05
  • @Leeor so, basically, SDN? – est Sep 23 '13 at 11:08
  • 1
    You mean SW defined network? Not only. I'm talking about actual HW reconfigure, network-on-chip is just an example (on the border between CPU architecture and network architecture). Like I said you can build HW that reassigns resources internally. Another example is a shared cache - there's some work about partitioning it dynamically between the cores according to need – Leeor Sep 23 '13 at 11:18