1

I am trying to compare a simple addition task with both CPU and GPU, but the results that I get are so weird.

First of all, let me explain how I managed to run the GPU task.

Let's dive into code now this is my code it simply

package gpu;
import com.aparapi.Kernel;
import com.aparapi.Range;


public class Try {
    public static void main(String[] args) {

        final int size = 512;
        final float[] a = new float[size];
        final float[] b = new float[size];

        for (int i = 0; i < size; i++) {
            a[i] = (float) (Math.random() * 100);
            b[i] = (float) (Math.random() * 100);
        }


        //##############CPU-TASK########################
        long start = System.nanoTime();
        final float[] sum = new float[size];
        for(int i=0;i<size;i++){
            sum[i] = a[i] + b[i];
        }
        long finish = System.nanoTime();
        long timeElapsed = finish - start;
        //######################################



        //##############GPU-TASK########################
        final float[] sum2 = new float[size];
        Kernel kernel = new Kernel(){
            @Override public void run() {
                int gid = getGlobalId();
                sum2[gid] = a[gid] + b[gid];
            }
        };

        long start1 = System.nanoTime();
        kernel.execute(Range.create(size));
        long finish2 = System.nanoTime();
        long timeElapsed2 = finish2 - start1;
        //##############GPU-TASK########################


        System.out.println("cpu"+timeElapsed);
        System.out.println("gpu"+timeElapsed2);

        kernel.dispose();
    }
}

My specs are:

Aparapi is running on an untested OpenCL platform version: OpenCL 3.0 CUDA 11.6.13
Intel Core i7 6850K @ 3.60GHz   Broadwell-E/EP 14nm Technology
2047MB NVIDIA GeForce GTX 1060 6GB (ASUStek Computer Inc)

The results that I get are this:

cpu12000
gpu5732829900

My question is why the performance of GPU is so slow. Why does CPU outperform GPU? I expect from GPU to be faster than the CPU does, my calculations are wrong, any way to improve it?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Mixalis Navridis
  • 181
  • 2
  • 15

1 Answers1

2

This code is measured the host side execution time for GPU task. It means that the measured time includes the time of the task execution on GPU, the time of copying the data for the task to GPU, the time of reading the data from GPU and the overhead that is introduced by Aparapi. And, according to the documentation for Kernel class, Aparapi uses lazy initialization:

On the first call to Kernel.execute(int _globalSize), Aparapi will determine the EXECUTION_MODE of the kernel. This decision is made dynamically based on two factors:

  • Whether OpenCL is available (appropriate drivers are installed and the OpenCL and Aparapi dynamic libraries are included on the system path).
  • Whether the bytecode of the run() method (and every method that can be called directly or indirectly from the run() method)
  • can be converted into OpenCL.

Therefore, the host side execution time for GPU task cannot be compared with the execution time for CPU task. Because it includes additional work that is performed only once.

In this case, it is necessary to use getProfileInfo() call to get the execution time breakdown for the kernel:

kernel.execute(Range.create(size));
List<ProfileInfo> profileInfo = kernel.getProfileInfo();
for (final ProfileInfo p : profileInfo) {
   System.out.println(p.getType() + " " + p.getLabel() + " " + (p.getEnd() - p.getStart()) + "ns");
}

Also, please note that the following property must be set: -Dcom.aparapi.enableProfiling=true. For more information please see Profiling the Kernel article and the implementation of ProfileInfo class.

Egor
  • 485
  • 4
  • 8
  • thanks foy our answer it really helps me but i cannot understand if Apparapi adding overhead for intiliaze for reading for passing data so whats the point of use it? If we cannot get benefited from a simple addition task can you describe a case where it will be useful – Mixalis Navridis Jul 29 '22 at 11:14
  • GPU is a highly parallel machine where computational work is done by an array of small cores. For example, GTX 1060 contains 1280 CUDA cores and each of them can execute an instruction in parallel. Therefore, it doesn't make sense to offload small portions of work to GPU, because it won't be fully utilized. And it is desirable that the kernel contains enough compute operations to hide a memory access latency. As a result, GPU should be used when you have enough work to hide the offload overhead (for example, for image processing, training deep neural networks, etc.). – Egor Jul 29 '22 at 20:14
  • @Egor so your recommendation its to not use GPU for this simple task which i called multiple times with dynamic data passed to a[] and b [] array – Panagiotis Drakatos Jul 29 '22 at 21:39
  • Yes, if the workload is pretty small then running it on CPU and even in a single thread can be the best choice. The approach that I described above is applicable not only for GPU offloading but also for multi-threading. For example, if you look at the implementation of [`tbb::parallel_sort`](https://github.com/oneapi-src/oneTBB/blob/master/include/oneapi/tbb/parallel_sort.h#L239) you'll see that the parallel algorithm is used only when the container has more than 500 elements. Otherwise, the overhead of creating additional threads exceeds the speedup from parallel sorting. – Egor Jul 30 '22 at 18:27