Use HOST as a DEVICE

Question

I work on OpenCL, and I have got only a CPU i3 core Duo => I possess only 1 device at all (my CPU). So basically, I guess my HOST (cpu) will also be the DEVICE. I tried to launch a kernel but the task assigned to the DEVICE (which is also the HOST) never terminate. After thinking about this issue, it seems obvious that the HOST waiting for the DEVICE (itself) to finish, is impossible. But does anyone know a means to overcome this issue? Maybe using clCreateSubDevice, to subdivide my only device into an host and a real device?

Without any code it's difficult to say, but normally you should be able to use the CPU as a device without any special "hoops". — Aderstedt, Mar 02 '16 at 06:09
There is no such thing as a i3 core duo, there are Intel Core i3 with two cores and there are Intel Core Duo CPUs. Please be more specific about which CPU exactly you are using. I also think that you likely have a different problem with your code. Many Core i3 also contain a GPU and host and device are then using separate parts of the chip. And even you use your CPU as a device, the code will run in a separate thread. — Jan Lucas, Mar 02 '16 at 19:41
Regardless of actual CPU and core count, most OpenCL CPU drivers use threads for the DEVICE work, so the HOST thread can continue doing what it does while the computation and other device activities continue. You are not required to call a blocking API for work to happen. — Dithermaster, Mar 02 '16 at 20:55

score 1 · Answer 1 · edited Mar 04 '16 at 13:54

You will find my java-like code below, in order you to let me know my mistake. Actually when i run the following code without the clFinish(commandQueue); (on bottom of code), I have the following output:

I use the platform Intel(R) OpenCL Enqueuing kernels... Pause for 15000 ms. Task INCOMPLETE

If I add clFinish(commandQueue) I have the output and my task is completed:

I use the platform Intel(R) OpenCL Enqueuing kernels... Event kernel status: CL_COMPLETE event ID: 10 runtime: 2.631ms Pause for 15000 ms. Task COMPLETE

So why the single clFinish() instruction allow me the task to complete ? thanks you in advance for explaination.

public class Test_CPU
{


    private static String programSource0 =
        "__kernel void vectorAdd(" +
        "     __global const float *a,"+
        "     __global const float *b, " +
        "     __global float *c)"+
        "{"+
        "    int gid = get_global_id(0);"+
        "    c[gid] = a[gid]+b[gid];"+
        "}";

    /**
     * The entry point of this sample
     *
     * @param args Not used
     */
    public static void main(String args[])
    {
        /**
        * Callback function that is called when the event ev has the event_status status and will display the runtime of execution kernel in seconds
        * @param event:        the event
        * @param event_status: status of the event
        * @param user_data:    data given by the user is an integer tag that can be used to match profiling output to the associated kernel
        * @return:             none
        */
        EventCallbackFunction kernelCommandEvent = new EventCallbackFunction()
        {
            @Override
            public void function(cl_event event, int event_status, Object user_data)
            {
                int evID = (int)user_data;
                long[] ev_start_time = new long[1];
                Arrays.fill(ev_start_time, 0);
                long[] ev_end_time = new long[1];
                Arrays.fill(ev_end_time, 0);
                long[] return_bytes = new long[1];
                double run_time = 0.0;

                clGetEventProfilingInfo (event, CL_PROFILING_COMMAND_QUEUED, Sizeof.cl_long, Pointer.to(ev_start_time), return_bytes);
                clGetEventProfilingInfo (event, CL_PROFILING_COMMAND_END   , Sizeof.cl_long, Pointer.to(ev_end_time), return_bytes);

                run_time = (double)(ev_end_time[0] - ev_start_time[0]);
                System.out.println("Event kernel status: " + CL.stringFor_command_execution_status(event_status) + " event ID: " + evID + " runtime: " + String.format("%8.3f", (run_time*1.0e-6)) + " ms.");
            }
        };

        // Initialize the input data
        int n = 1000000;
        float srcArrayA[] = new float[n];
        float srcArrayB[] = new float[n];
        float dstArray0[] = new float[n];

        for (int i=0; i<srcArrayA.length; i++)
        {
            srcArrayA[i] = i;
            srcArrayB[i] = i;
        }
        Pointer srcA = Pointer.to(srcArrayA);
        Pointer srcB = Pointer.to(srcArrayB);
        Pointer dst0 = Pointer.to(dstArray0);

        // The platform, device type and device number that will be used
        final int platformIndex = 1;
        final long deviceType = CL_DEVICE_TYPE_CPU;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        long size[] = new long[1];
        clGetPlatformInfo(platform, CL_PLATFORM_NAME, 0, null, size);
        // Create a buffer of the appropriate size and fill it with the info
        byte buffer[] = new byte[(int)size[0]];
        clGetPlatformInfo(platform, CL_PLATFORM_NAME, buffer.length, Pointer.to(buffer), null);
        // Create a string from the buffer (excluding the trailing \0 byte)
        System.out.println("I use the platform " +  new String(buffer, 0, buffer.length-1));

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);

        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];

        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        cl_context context = clCreateContext(contextProperties, 1, new cl_device_id[]{device}, null, null, null);

        // Create a command-queue, with profiling info enabled
        long properties = 0;
        properties |= CL.CL_QUEUE_PROFILING_ENABLE;
        cl_command_queue commandQueue = CL.clCreateCommandQueue(context, devices[0], properties, null);

        // Allocate the buffer memory objects
        cl_mem srcMemA = CL.clCreateBuffer(context, CL.CL_MEM_READ_ONLY | CL.CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, srcA, null);
        cl_mem srcMemB = CL.clCreateBuffer(context, CL.CL_MEM_READ_ONLY | CL.CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, srcB, null);
        cl_mem dstMem0 = CL.clCreateBuffer(context, CL.CL_MEM_READ_WRITE, Sizeof.cl_float * n, null, null);

        // Create and build the the programs and the kernels
        cl_program program0 = CL.clCreateProgramWithSource(context, 1, new String[]{ programSource0 }, null, null);

        // Build the programs
        CL.clBuildProgram(program0, 0, null, null, null, null);

        // Create the kernels
        cl_kernel kernel0 = CL.clCreateKernel(program0, "vectorAdd", null);

        // Set the arguments
        CL.clSetKernelArg(kernel0, 0, Sizeof.cl_mem, Pointer.to(srcMemA));
        CL.clSetKernelArg(kernel0, 1, Sizeof.cl_mem, Pointer.to(srcMemB));
        CL.clSetKernelArg(kernel0, 2, Sizeof.cl_mem, Pointer.to(dstMem0));

        // Set work-item dimensions and execute the kernels
        long globalWorkSize[] = new long[]{n};

        System.out.println("Enqueueing kernels...");
        cl_event[] myEventID = new cl_event[1];
        myEventID[0] = new cl_event();
        clEnqueueNDRangeKernel(commandQueue, kernel0, 1, null, globalWorkSize, null, 0, null, myEventID[0]);

        int ID[] = new int[1];
        ID[0] = 10;
        clSetEventCallback(myEventID[0], CL_COMPLETE, kernelCommandEvent, ID[0]);

        clFinish(commandQueue);
        System.out.println("Pause for 15000 ms.");
        try
        {
            Thread.sleep(15000);
        }
        catch(InterruptedException iEx)
        {
            iEx.printStackTrace();
        }

        // See if task completed
        int[] ok = new int[1];
        Arrays.fill(ok, 0);
        clGetEventInfo(myEventID[0], CL_EVENT_COMMAND_EXECUTION_STATUS, Sizeof.cl_int, Pointer.to(ok), null);
        if (ok[0] == CL_COMPLETE) System.out.println("Task COMPLETE");else System.out.println("Task INCOMPLETE");
    }
}

score -1 · Answer 2 · answered Mar 02 '16 at 15:31

-1

I think my thoughts were not so bad, because indeed, you need to programmatically force the HOST to switch to DEVICE work, in such a case both HOST and DEVICE are the same hardware.

In fact, it is possible to have the HOST as a DEVICE, but in order to let the DEVICE work, you need to invoke at least one blocking function (clFinish(), or clEnqueueRead (... CL_TRUE, ...)). Otherwise, the HOST will always work and will never switch to DEVICE work. I tried to add a sleep() function, but it did not work, you really need to add a blocking opencl function instead.

Thanks at any rate!

answered Mar 02 '16 at 15:31

Algernon2

23
5

That is not true, the enqueue functions are not blocking. For example, the `clEnqueueNDRangeKernel` function, just puts the kernel into a queue. The host thread continues execution even if the kernel has not completed yet. – Martin Zabel Mar 03 '16 at 07:28
Thanks you Martin, but please see my code before my previous reply, in order to let me know my mistake. In case the clFinish(...) instruction is commented, I retrieve Task INCOMPLETE as output, and Task COMPLETE ain case clFinish(...) is decommented. Could you say me why and where is my mistake ? Thank you – Algernon2 Mar 03 '16 at 09:11
Yes, `clFinsh` waits for command completion and thus blocks the host thread. But, the `clEnqueueNDRangeKernel` is still non-blocking. You can do whatever you want an the host between both calls. – Martin Zabel Mar 03 '16 at 09:15
Ok, but at the end of my code, I call a sleep for 15 seconds, but without adding clFinish(), the callback function is never called when the kernel is completed, why? – Algernon2 Mar 03 '16 at 09:17
This is a different question. Please ask a new question and provide a [minimal, complete and verifiable example](http://stackoverflow.com/help/mcve). – Martin Zabel Mar 03 '16 at 09:20
And by the way, if thread sleeps then it may not be possible to process the callback. – Martin Zabel Mar 03 '16 at 09:38

Use HOST as a DEVICE

2 Answers2