So I am trying to hide global memory latency. Take the following code:
for(int i = 0; i < N; i++){
x = global_memory[i];
... do some computation on x ...
global_memory[i] = x;
}
I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code:
x_next = global_memory[0];
for(int i = 0; i < N; i++){
x = x_next;
x_next = global_memory[i+1];
... do some computation on x ...
global_memory[i] = x;
}
In this code, x_next is not used until next iteration, so does loading x_next overlap with the computation? In other words, which of the following figures will happen?