0

I've tested the following on a GTX 690 GPU with 4GB RAM in Windows 7 x64, Visual C++ 10:

I've written a function that receives 2 vectors and adds into a 3rd vector. The task is broken over 2 GPU devices. I gradually increased the vector size to benchmark GPU performance. The required time linearly increases relative to vector size up to a certain point and then it abruptly jumps up. When I disable each of the GPU cores, the required time stays linear to the end of available memory. I've enclosed a diagram displaying required time versus allocated memory.

You can see the speed diagram here: Speed Comparison Diagram!

Can you tell me what is wrong?

Bests, Ramin

This is my code:

unsigned    BenchMark( unsigned VectorSize )
{
    unsigned *      D[ 2 ][ 3 ] ;

    for ( int i = 0 ; i < 2 ; i++ )
    {
        cudaSetDevice( i ) ;

        for ( int j = 0 ; j < 3 ; j++ )
            cudaMalloc( & D[ i ][ j ] , VectorSize * sizeof( unsigned ) ) ;
    }

    unsigned    uStartTime = clock() ;

    // TEST
    for ( int i = 0 ; i < 2 ; i++ )
    {
        cudaSetDevice( i ) ;

        AddKernel<<<VectorSize/256,256>>>(
            D[ i ][ 0 ] ,
            D[ i ][ 1 ] ,
            D[ i ][ 2 ] ,
                VectorSize ) ;
    }

    cudaDeviceSynchronize() ;
    cudaSetDevice( 0 ) ;
    cudaDeviceSynchronize() ;

    unsigned    uEndTime = clock() ;

    for ( int i = 0 ; i < 2 ; i++ )
    {
        cudaSetDevice( i ) ;

        for ( int j = 0 ; j < 3 ; j++ )
            cudaFree( D[ i ][ j ] ) ;
    }

    return uEndTime - uStartTime ;
}

__global__ void AddKernel(
                    const   Npp32u *    __restrict__    pSource1 ,
                    const   Npp32u *    __restrict__    pSource2 ,
                        Npp32u *    __restrict__    pDestination ,
                        unsigned            uLength )
{
    unsigned    x = blockIdx.x * blockDim.x + threadIdx.x ;

    if ( x < uLength )
        pDestination[ x ] = pSource1[ x ] + pSource2[ x ] ; 
}
  • The visual profiler may give you some clues. – Robert Crovella Jun 13 '13 at 04:21
  • Have you checked if an error occured in one of the CUDA API calls? Maybe a cudaSetDevice call fails. Than the kernel will be invoked on the wrong device using features like UVA and peer access to read and write the other devices memory resulting in an extreme performance impact. – Michael Haidl Jun 13 '13 at 12:42
  • Dear Kronos, I have checked errors and it is really done on different cores. On big vectors, the required amount of memory can not be allocated on one core. – Ramin Halavati Jun 13 '13 at 14:10
  • Dear Robert, I used Visual Profiler, when smaller vectors are used, the two AddKernels run with very small delay between them, but when vector sizes are increased, this delay is very large and one core starts the task almost 1 second after the other one and the profiler shows it idle during this time. – Ramin Halavati Jun 15 '13 at 05:34

1 Answers1

1

I found the answer. The problem happened as SLI was active, I disabled it and now it is working smoothly.