7

I have simple kernel:

__kernel vecadd(__global const float *A,
                __global const float *B,
                __global float *C)
{
    int idx = get_global_id(0);
    C[idx] = A[idx] + B[idx];
}

Why when I change float to float4, kernel runs more than 30% slower?

All tutorials says, that using vector types speeds up computation...

On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.

Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.

Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.

EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.

For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?

ldanko
  • 557
  • 8
  • 20
  • 1
    What are the values of global work size and workgroup size? What time are you measuring, and how? – Eric Bainville Jan 20 '12 at 04:44
  • global work size = 1024*1024 local work size = 256, I measure time of clEnquueNDRangeKernel using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END. For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why? – ldanko Jan 21 '12 at 18:06
  • The difference between smaller and larger work size may be due to your constant cache. So 2 questions: 1) if your remove the const, is it still faster for small and slower for large? 2) if you go somewhere inbetween, say 65536 for float and 16384 for float4, what happens then? – user1111929 Feb 07 '12 at 16:57

2 Answers2

6

I don't know what are the tutorials you refer to, but they must be old. Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now. Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code. It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.

lucho
  • 236
  • 3
  • 4
1

I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.

If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?

__kernel vecadd(__global const float4 *B,
                __global float4 *C)
{
    int idx = get_global_id(0);
    C[idx] += B[idx];
}

Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.

__kernel vecadd(__global const float4 *A,
                __global const float4 *B,
                __global float4 *C)
{
    int idx = get_global_id(0);
    float4 tmp = A[idx] + B[idx];
    C[idx] = tmp;
}
mfa
  • 5,017
  • 2
  • 23
  • 28