With reference to this question, is it possible that the gain in performance obtained by vectorized operations is being offset by the explicit conversions with convert_T()? Note that the default type of the variable is unsigned char
. I am using OpenCL 2.0. My GPU is Intel HD Graphics 530 (Gen9).
Will it make a difference between convert_int4()
and convert_short4()
?