I have simple kernel:
__kernel vecadd(__global const float *A,
__global const float *B,
__global float *C)
{
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Why when I change float to float4, kernel runs more than 30% slower?
All tutorials says, that using vector types speeds up computation...
On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.
Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.
Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.
EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.
For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?