Why PyCUDA is faster than C CUDA in this example

Question

I am exploring to move from OpenCL to CUDA, and did a few tests to benchmark the speed of CUDA in various implementations. To my surprise, in the examples below, the PyCUDA implementation is about 20% faster than the C CUDA example.

I read many posts talking about "release build" of C CUDA code. I did try having -Xptxas -O3 in the makefile and that really did not make a difference. I also tried to adjust the block size, with which the kernel was executed. Unfortunately, it did not help improve the speed, either.

My questions here are:

What could be the reasons leading to the speed difference between C CUDA and PYCUDA?
If the "advanced" (lack of a better word) compiling in PYCUDA is one of reasons, how can I optimize the compiling of my C CUDA code?
Are there any other ways to improve the speed of C CUDA in this case?

While I appreciate general comments, I am looking for actionable suggestions that I can validate on my machine. Thanks!

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np

from pycuda.compiler import SourceModule
import time


mod = SourceModule(
    """
__global__ void saxpy(int n, const float a, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n){
        y[i] = a * x[i] + y[i];
    }
}
"""
)

saxpy = mod.get_function("saxpy")

N = 1 << 25
time_elapse = 0.0

for i in range(100):
    # print(i)
    # print(N)

    x = np.ones(N).astype(np.float32)
    y = 2 * np.ones(N).astype(np.float32)
    start = time.time()
    saxpy(
        np.int32(N),
        np.float32(2.0),
        drv.In(x),
        drv.InOut(y),
        block=(512, 1, 1),
        grid=(int(N / 512) + 1, 1),
    )
    time_elapse += (time.time() - start)


print(time_elapse )
print(y[-100:-1])
print(y.sum())
print(N * 4.0)

#include <stdio.h>
#include <time.h>
#define DIM 512



__global__ void saxpy(int n, float a, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)
        y[i] = a * x[i] + y[i];
}

int main(int num_iterations)
{
    double start;
    double cputime;
    int N = 1 << 25;
    float *x, *y, *d_x, *d_y;
    int i, j;
    for (j = 0; j < num_iterations; j++)
    {
        x = (float *)malloc(N * sizeof(float));
        y = (float *)malloc(N * sizeof(float));

        cudaMalloc(&d_x, N * sizeof(float));
        cudaMalloc(&d_y, N * sizeof(float));

        for (i = 0; i < N; i++)
        {
            x[i] = 1.0f;
            y[i] = 2.0f;
        }

        cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice);

        // Perform SAXPY on 1M elements
        start = clock();
        saxpy<<<(N + DIM) / DIM, DIM>>>(N, 2.0f, d_x, d_y);
        cputime += ((double)(clock() - start) / CLOCKS_PER_SEC);
        cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);

        // float maxError = 0.0f;
        // for (int i = 0; i < N; i++){
        //     maxError = max(maxError, abs(y[i] - 4.0f));
        //     //printf("y[%d]: %f\n", i,y[i]);
        // }
        // printf("Max error: %f\n", maxError);

        cudaFree(d_x);
        cudaFree(d_y);
        free(x);
        free(y);
    }

 
    printf("cpu time is %f\n", cputime);
    return 0;
}

I saved the above file as cuda_example.cu and compile it with the following commands in a makefile:

nvcc -arch=sm_61 -Xptxas -O3,-v -o main cuda_example.cu

`-Xptxas -O3` is the default used by `nvcc`. Your timing framework times a mix of host activity and device activity, which is generally a bad idea. You would want to compare the execution time of the respective kernels *in isolation*, then compare the execution time of the respective host code. I suspect you will find the timing difference is in the latter. — njuffa, Mar 01 '21 at 19:45
In the pycuda code you never copy data back after you compute the result, you simply loop over the kernel again. In the C version, you allocate, initialyse, copy, compute, copy and free all in the same loop body. that is simply inefficient and has nothing to do with the kernel, or the optimisations made by the compiler. — geebert, Mar 01 '21 at 19:57
Also, are you trying to take number_iterations as a command line input argument? int main(int num_iterations) is not the way to do that. — geebert, Mar 01 '21 at 20:03
@geebert Oh, sorry, I actually renamed the function to `main` without noticing that. I will change it. When I tried to time `saxpy<<<(N + DIM) / DIM, DIM>>>(N, 2.0f, d_x, d_y)`, it gave me an unrealistically small number. I guess it only times the time to enqueue (not the real execution time)? — w.tian, Mar 01 '21 at 20:32
kernel launches in cuda are asynchronous, therefore if you only measure one kernel execution, you should have either a runtime api call like cudamemcpy, or cudadeviceSynchronize() after the kernel call, otherwise you are right, it is probably not the real execution time. — geebert, Mar 01 '21 at 20:39
@geebert right...then, looking at your numbers, we cans still find considerable difference between C CUDA and PYCUDA. What is the reason behind that? — w.tian, Mar 01 '21 at 20:45
The pycuda.driver.in() and out() methods still copy data back and forth between device and host. You can reduce some overhead by using prepared invocations, indicating the argument types beforehand. If you want a fair and realistic comparison between the C and pycuda code, use cuda.memcpy_htod before the loop, prepare the kernel call using kernelname.prepare(), and then loop over the actual kernel call. I suspect pycuda will then be much closer to the performance of the Cuda-C-code. — geebert, Mar 02 '21 at 09:15

geebert · Accepted Answer · 2021-03-01T21:36:31.330

If I execute your CUDA-C code as is, and set num_iterations to 300 like this:

int num_iterations =300;

then the execution of your program takes about 60s on a Geforce GTX 1650. Your code is extremely inefficient, as you copy data back and forth between GPU and device at every iteration. So, lets restrict the loop to just the kernel execution:

#include <stdio.h>
#include <time.h>
#define DIM 512

__global__ void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
    y[i] = a * x[i] + y[i];
}

int main()
{
double start = clock();
int N = 1 << 25;
float *x, *y, *d_x, *d_y;
int i, j;

int num_iterations = 300;
x = (float *)malloc(N * sizeof(float));
y = (float *)malloc(N * sizeof(float));

cudaMalloc(&d_x, N * sizeof(float));
cudaMalloc(&d_y, N * sizeof(float));

for (i = 0; i < N; i++)
{
   x[i] = 1.0f;
   y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice);

for (j = 0; j < num_iterations; j++){
    saxpy<<<(N + DIM) / DIM, DIM>>>(N, 2.0f, d_x, d_y);
    cudaDeviceSynchronize();
}
cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);

double cputime = ((double)(clock() - start) / CLOCKS_PER_SEC);
printf("cpu time is %f\n", cputime);
return 0;
}

If I do that, then the execution time becomes 1.36 seconds. Doing sth similar to the PyCUDA code I got about 19s of execution time.

Running 100 iterations, C CUDA took 0.001648 second and PCUDA 4.6 seconds to execute the kernel. My GPU is NVIDIA P2000. — w.tian, Mar 01 '21 at 20:39
for 100 iterations with the above code I get about 0.6s, which seems reasonable to me. — geebert, Mar 01 '21 at 20:43

Why PyCUDA is faster than C CUDA in this example

1 Answers1