I have a multi dimension array stored in device memory. I want to "permute"/"transpose", that is, re-arrange its elements according to new order of dimensions.
For example, if I have a 2D array
A = [0, 1, 2
3, 4, 5]
I want to change the order of dimension so I get
B = [0, 3
1, 4
2, 5]
This re-ordering practically copies the elements that are stored in memory in the order [0,1,2,3,4,5]
and return a new ordering [0,3,1,4,2,5]
.
I know how to map the indices from A
to B
, my question is how I can execute this mapping efficiently on device using cuda?