I have a code that accesses ~4GB of memory sequentially, it accesses 1024bits per request, randomly across all 4GB... I have a RADEON VII with 16GB HBM2, with 4096bit BUS.
Possible optimization 1: 4GB and 4x data per mem request! (doesn't work because first request tells me second request across those 4GB, so the needed data for the second request may be far away in memory)
Possible optimization 2: 4+4+4+4GB and 1x data per mem request! (doesn't improve performance because each request to a 4GB group delays the other ones to 0.25x performance, so I get 4 Threads with 0.25x performance each)
Questions:
For optimization 1 - Is it possible to split the 4096bit BUS, so I can fetch different areas of 1024bits of the memory in parallel in a non Blocking way?
For optimization 2 - Is it possible to address 'blocks' of 4GB in parallel, in a way that each block is independent, and non Blocking for the others?
PS - I know it depends on the memory controller, so if you know a different hardware that can do this, please let me know too.