0

I need to allocate as much of struct Things as GPGPU memory allows and invoke the kernel for every struct Thing.

OpenCL disallows allocating all the CL_DEVICE_GLOBAL_MEM_SIZE memory at once - you can allocate at most CL_DEVICE_MAX_MEM_ALLOC_SIZE per single allocation. The second one is regularly 4 times less then all the memory. So I decided to create 4 buffers.

Also, you cannot use pointers to pointers within both OpenCL kernels and when passing args to kernels from host, so you can't pass an array of buffers to the kernel (since every buffer is a pointer to first struct Thing in an array).

So far my kernel code is like this:

kernel void workWithThings(
    constant uint64_t t1Count,
        global struct Thing * t1,
    constant uint64_t t2Count,
        global struct Thing * t2,
    constant uint64_t t3Count,
        global struct Thing * t3,
    constant uint64_t t4Count,
        global struct Thing * t4
)
{
    private ulong gid = get_global_id( 0 );
    private struct Thing * t;
    if ( gid > t1Count )
    {
        gid -= t1Count;
        if ( gid > t2Count )
        {
            gid -= t2Count;
            if ( gid > t3Count )
            {
                gid -= t3Count;
                t = & t4[ gid ];
            }
            else
            {
                t = & t3[ gid ];
            }
        }
        else
        {
            t = & t2[ gid ];
        }
    }
    else
    {
        t = & t1[ gid ];
    }

    //do the actual work:
    //t->...
}

Is this really the only way to do it? I feel very stupid writing code like this. Please help.

Slaus
  • 2,086
  • 4
  • 26
  • 41
  • 1
    Why do you need to allocate all the memory? can't you perform processing in batches? What CL driver are you using if you have CL 2.0 you could use pointers. – kanna Jul 09 '17 at 18:30
  • @kanna The task is like neural networks app where the more neurons there are is better, so I want to fill all the memory. Host injects several bytes of input data 10 times per second and GPGPU chews it with it's neurons as many times as it can and outputs several bytes also 10 times per second. That global overall memory never intended to loaded to/from host. Currently it's OpenCL 1.1 :( – Slaus Jul 09 '17 at 18:38
  • 2
    GPU is better when compute/copy is overlapped. This means it will perform better with multiple executions in flight unless there is no host interaction. – huseyin tugrul buyukisik Jul 09 '17 at 19:55
  • 1
    @huseyintugrulbuyukisik I'll try to overlap GPGPU<->host read/writes with computations if I can. Thank you for the point. – Slaus Jul 09 '17 at 20:03

0 Answers0