I'm finding myself bandwidth-constrained in a parallel computing application, and I've profiled the program during execution. The critical data is expectedly in a contiguous line, but the memory dumps show it is always ending up entirely or mostly on 1 stick of RAM. This would be fine if we had 60GB/s RAM, but we don't.
Someone has to have solved the problem of multi-channel allocations. It's way too common a problem in HPC.