This wiki page doesn't make much sense (and I think it's also noted in the talk page there).
The example machine is quite meaningless as they ignore the fact that to pipeline the accesses, you would need a memory that can not only sustain 8 simultaneous requests (coming from the different pipelined stages), but also complete them in a single cycle. Banking or splitting the memory in any way wouldn't really work as they all access the same addresses of B.
You could stretch it a bit and say that you've cloned B into 8 different memory units, but then you'll have to find some more complicated controller to keep the coherency, otherwise you'll only be able to use them for reading.
On the other hand, if you had this kind of memory, then the "CPU" they're competing against should be allowed to use it. If we had this banked memory, a modern CPU with out-of-order execution would be able to issue the following instructions for e.g., under the same assumption of 1 cycle per load:
1st cycle: load a[i], calculate i+1
2nd cycle: load a[i+1], load b[a[i]], calculate (i+1)+1
3nd cycle: load a[i+2], load b[a[i+1]], load b[b[a[i]]], calculate i+1+1+1
...
So it would essentially do just as well as the special pipeline they show, even with a basic compiler. Note that a modern CPU can look far ahead in the execution window to find independent operations, but if the compiler does loop unrolling (which is a basic feature supported in most languages) it could re-order the operations in a way that makes it easier for the CPU to issue them.
As for your question about compilers - you didn't specify which feature exactly you think can solve this. Generally speaking - these problems are very hard to optimize through a compiler since you can't mitigate the latency of the memory dependencies. In other words, you'll first have to access a[i], only then the CPU will have the address to access b[a[i]], only then it will have the address for b[b[a[i]]], and so on. There's not much the compiler can do in order to guess the content of memory not yet accessed (and even it it did speculate, it wouldn't be smart to use it for anything practical as it may change by the time the actual load arrives in program order).
This is similar to problem of "pointer chasing" where you traverse a linked list - the required addresses are not only unknown at compile time, but are also hard to predict at runtime, and may change.
I'm not saying this can't be optimized, but it would usually require some dedicated HW solution (such as the memory banking), or some fancy speculative algorithm that would be quite limited in its use. There are papers on the topic (mostly HW prefetching), for e.g. - http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=765944