OpenCL nested loop misalignment

Question

I'm trying to use GPU for some image processing. In my kernel function I catched "misalignment" exception as

The thread tried to read or write data that is misaligned on hardware that does not provide alignment. For example, 16-bit values must be aligned on 2-byte boundaries; 32-bit values on 4-byte boundaries, and so on.

I reduced the kernel code to loops only, but I still got this problem. My reduced kernel function:

__kernel void TestKernel(
    global const uchar* iImage, 
    global uchar* oImage, 
    uint width,
    uint heigth, 
    uchar dif,
    float power)
{
   uint y = get_global_id(0);

    if (y >= heigth) 
        return; 

    for (uint x = 0; x< width; ++x){
        for (uint i = 0; i < 5; ++i) {
            uint sum = 0;
            for (uint j = 0; j<5; ++j) {
                sum += 3;
            }
        }

    }   
}

(program throws exception in the second loop)

I'm using the C++ wrapper to call my kernel

kernel.setArg(iArg++, iImage);
    kernel.setArg(iArg++, oImage);
    kernel.setArg(iArg++, header.GetVal(header.Width));
    kernel.setArg(iArg++, header.GetVal(header.Height));
    kernel.setArg(iArg++, (unsigned char)10);
    kernel.setArg(iArg++, saturation);


    queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(header.GetVal(header.Height)), cl::NDRange(128));

oImage and iImage are cl::Buffer

saturation is float

header.GetVal() returns int

I'm using Visual Studio 2015 with CodeXL plugin and run the program on AMD Spectre(Radion R7).

What can cause this problem?

if you are interpreting char array as floats, then you are responsible to load from aligned addresses. Maybe you should move individual bytes from global memory to private or local memory and convert in there. Maybe you can first load the un misaligned bytes then load the rest as aligned and combine in private memory? — huseyin tugrul buyukisik, Apr 19 '17 at 21:21
@huseyintugrulbuyukisik Can you please point me, where do i exactly treat char arrays as floats? — MasterUZ, Apr 19 '17 at 22:54
If you aren't using vload vstore floats or ints from oImage or iImage, then drivers must be bugging since there is nothing in second loop that loads or stores anything global. — huseyin tugrul buyukisik, Apr 19 '17 at 23:40
How did you come to know that the problem is in the second loop? To my mind the loop will be optimized out because it's not doing anything interesting, just incrementing temporary which isn't used anywhere. I think the problem may lie in how you create oImage and iImage buffers which although you don't currently use in the kernel but still they are passed to the kernel. — doqtor, Apr 20 '17 at 07:07
@doqtor 1) In debug mode i stepped throw the kernel and exception occured at the second loop initialization. 2) Project was builded in debug mode with default optimization lvl. And in not redused kernel (it did some work on buffers) problem was at the same line of code. 3) Buffers were build like `cl::Buffer iImage = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size, bmpInStream, &err);` there `size` is correct int size of char array. `err` is also says about correct operation. What in this buffer can cause a problem? — MasterUZ, Apr 20 '17 at 09:17
It doesn't seem that what debugger is showing you make sense. I would rather use `printf` and do step by step elimination of part of the code to nail down the problem. — doqtor, Apr 20 '17 at 10:16

OpenCL nested loop misalignment

0 Answers0