motion compensation between two images(3840*2160), block size 16
kernel divide 3840 * 135(135=2160/16), group size 64*1 or 128*1 (basically no difference)
Now my kernel do access global char data, but imagepos = src + mv.xy
is not aligned, so must read char one by one. I think there is a latency here, CodeXL also show there is no limited by GPRs. So i need find a method to speed up data read. Also i want to know how to use local memory but data just need once.
Any suggestion will be appreciated.