1

Tested the throughput of PCIe with OpenCL and I am getting strange results.

I am using PCIe 3, x16. DATA_SIZE = 2097152 (int) Data size bytes = 8.388608 MB

The code is following:

struct timeval  tv1, tv2;
gettimeofday(&tv1, NULL);

err=clEnqueueWriteBuffer(the_queue, dev_buffer, CL_TRUE, 0, (size_t)sizeof(int)*DATA_SIZE, (void *)host_buffer, 0, NULL, &event);
if (err != CL_SUCCESS) {
    printf("Error:clEnqueueWriteBuffer:dbuff_in(code %d)\n\n", err );
    exit(0);
}
clFinish(the_queue);

gettimeofday(&tv2, NULL);

printf ("Total time = %f seconds\n",
       (double) (tv2.tv_usec - tv1.tv_usec) / 1000000 +
       (double) (tv2.tv_sec - tv1.tv_sec));

err=clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,sizeof(cl_ulong), &end, NULL);
if (err != CL_SUCCESS) {
    printf("Error in profiling !!!Err code %d\n\n", err );
    exit(0);
}

err=clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,sizeof(cl_ulong), &start, NULL);
if (err != CL_SUCCESS) {
    printf("Error in profiling !!!Err code %d\n\n", err );
    exit(0);
}

transfertime=transfertime+(end - start) * 1.0e-6f;
printf("%f  ",transfertime/1000);

When running this code I get the following results:

Total time = 11.2 ms

Transfer time = 0.02 ms

Note: I used two different methods for measure the transfer time one is using OpenCL profiling information, other one is just measure the start of the code and end.

Taking into consideration that PCIe3 x16 has 15.62884615 GB throughput the theoretical time would be ~0.54 ms to transfer my data.

Using the profiling for bench-marking I got the result which is roughly 25 faster then theoretical maximum of the PCIe3, when using the standard bench-marking(start-end of the code) I got the result which is ~ 20 times less than theoretical speed.

Could you please share your experience with this. What is the right way of bench-marking? And overall why the behavior is so strange?

PS. The device is FPGA (VU9P)

Art0
  • 61
  • 1
  • 5

0 Answers0