0

I implemented a RNS Montgomery exponentiation in Cuda.

Everything nice everything fine. It runs on just one SM.

BUT, so far I focus on parallelization of just a single exp. What I want to do now is test with several exp on fly. That is, I want that the i-th next exp is assign to a free SM.

I tried, and the final time was always growing linearly, that is all the exp were assign to the same SM.

Then I switched to streams, but nothing changed.

However I have never used them, so maybe I am doing something wrong..

This is the code:

void __smeWrapper() {
    cudaEvent_t start, stop;
    cudaStream_t stream0, stream1, stream2;
    float time;
    unsigned int j, i, tmp;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    dim3 threadsPerBlock(SET_SIZE, (SET_SIZE+1)/2);

    setCudaDevice();

    s_transferDataToGPU();

    if(cudaDeviceSetCacheConfig(cudaFuncCachePreferL1) != cudaSuccess)
        printf("cudaDeviceSetCacheConfig ERROR!");

    cudaEventRecord( start, 0 );

    //for(i=0; i<EXPONENTIATION_NUMBER; i++)    {
    i=0;
        __me<<< 1, threadsPerBlock, 0, stream0 >>>(&__s_x[i*(2*SET_SIZE + 1)],     __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar,    
                                            __mmi_BinAUar, __mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
    i=1;
        __me<<< 1, threadsPerBlock, 0, stream1 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar,
                                        __mmi_BinAUar, __mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
    i=2;
    __me<<< 1, threadsPerBlock, 0, stream2 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar, __mmi_BinAUar,
                                                                    __mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
        //printf("\n%s\n\n", cudaGetErrorString(cudaGetLastError()));
    //}

cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
    printf("GPU %f µs : %f ms\n", time*1000, time);

cudaEventDestroy( start );
cudaEventDestroy( stop );

Ubuntu 11.04 64b, Cuda 5 RC, 560 Ti (8 SM)

einpoklum
  • 118,144
  • 57
  • 340
  • 684
elect
  • 6,765
  • 10
  • 53
  • 119

1 Answers1

2

All threads from a block always run on a same SM. You need to start more then one block to use other SMs.

There seems to be something wrong with your streams - do you call cudaStreamCreate for every stream? On my system it crashes with SEGFAULT if I don't use one though.

Eugene
  • 9,242
  • 2
  • 30
  • 29
  • Yeah, I increased the block number and it worked. Thanks. Just a question, is there a way to have multiple istance of __me running parallel (each on a different SM) or the one you suggested is the only and/or right way to do it? @Eugene – elect Aug 28 '12 at 12:10
  • I only just noticed that you were using streams. Where are you creating the streams and how do you detect that streams are not concurrent? Following code attempts to run the code in parallel (my kernels are too small and complete before the next one starts): for (i = 0; i < 10; i++) { CUDA_CHECK_RETURN(cudaStreamCreate(stream +i)); bitreverse<<<1, WORK_SIZE, WORK_SIZE * sizeof(int), stream[i]>>>(d); CUDA_CHECK_RETURN(cudaStreamDestroy(stream[i])); } – Eugene Aug 28 '12 at 16:00