Concurrent execution of two processes sharing a Tesla K20

Question

I have been experiencing a strange behaviour when I launch 2 instances of a kernel in order to run at the same time while sharing the GPU resources.

I have developed a CUDA kernel which aims to run in a single SM (Multiprocessor) where the threads perform an operation several times (with a loop).

The kernel is prepared to create only a block, therefore to use only one SM.

simple.cu

#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>
#include <helper_cuda.h>
using namespace std;

__global__ void increment(float *in, float *out)
{
    int it=0, i = blockIdx.x * blockDim.x + threadIdx.x;
    float a=0.8525852f;

    for(it=0; it<99999999; it++)
             out[i] += (in[i]+a)*a-(in[i]+a);
}

int main( int argc, char* argv[])
{
    int i;
    int nBlocks = 1;
    int threadsPerBlock = 1024;
    float *A, *d_A, *d_B, *B;
    size_t size=1024*13;

    A = (float *) malloc(size * sizeof(float));
    B = (float *) malloc(size * sizeof(float));

    for(i=0;i<size;i++){
            A[i]=0.74;
            B[i]=0.36;
    }

    cudaMalloc((void **) &d_A, size * sizeof(float));
    cudaMalloc((void **) &d_B, size * sizeof(float));

    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);

    increment<<<nBlocks,threadsPerBlock>>>(d_A, d_B);

    cudaDeviceSynchronize();

    cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);

    free(A);
    free(B);

    cudaFree(d_A);
    cudaFree(d_B);

    cudaDeviceReset();

    return (0);
}

So if I execute the kernel:

time ./simple

I get

real 0m36.659s user 0m4.033s sys 0m1.124s

Otherwise, If I execute two instances:

time ./simple & time ./simple

I get for each process:

real 1m12.417s user 0m29.494s sys 0m42.721s

real 1m12.440s user 0m36.387s sys 0m8.820s

As far as I know, the executions should run concurrently lasting as one (about 36 seconds). However, they last twice the base time. We know that the GPU has 13 SMs, each one should execute one block, thus the kernels only create 1 block.

Are they being executed in the same SM?

Shouldn’t they running concurrently in different SMs?

EDITED

In order to make me clearer I will attach the profiles of the concurrent execution, obtained from nvprof:

Profile, first instance

Profile, second instance

Now, I would like to show you the behavior of the same scenario but executing concurrently two instances of matrixMul sample:

Profile, first instance

Profile, second instance

As you can see, in the first scenario, a kernel waits for the other to finish. While, in the second scenario (matrixMul), kernels from both contexts are running at the same time.

Thank you.

talonmies · Answer 1 · 2015-10-05T17:21:37.493

3

When you run two separate processes using the same GPU, they each have their own context. CUDA doesn't support having multiple contexts on the same device simultaneously. Instead, each context competes for the device in an undefined manner, with driver level context switching. That is why the execution behaves as if the processes are serialised -- effectively they are, but at a driver rather than GPU level.

There are technologies available (MPS, Hyper-Q) which can do what you want, but the way you are trying to do this won't work.

Edit to respond to the update in your question:

The example you have added using the MatrixMul sample doesn't show what you think it does. That application runs 300 short kernels and computes a performance number over the average of those 300 runs. Your profiling display has been set to a very coarse timescale resolution so that it looks like there is a single long running kernel launch, when in fact it is a series of very short running time kernels.

To illustrate this, consider the following:

This is a normal profiling run for a single MatrixMul process running on a Kepler device. Note that there are many individual kernels running directly after one another.

These are the profiling traces of two simultaneous MatrixMul processes running on the same Kepler device:

Note that there are gaps in the profile traces of each process, this is where context switching between the two processes is occurring. The behaviour is identical to your original example, just at a much finer time granularity. As has been repeated a number of times by several different people in the course of this discussion -- CUDA doesn't support multiple contexts on the sample device simultaneously using the standard runtime API. The MPS server does allow this by adding a daemon which reimplements the API with a large shared internal Hyper-Q pipeline, but you are not using this and it has no bearing on the results you have shown in this question.

edited Oct 05 '15 at 17:21

answered Oct 01 '15 at 14:07

talonmies

70,661
34
192
269

I see, I didn't know about it, however, when I launch 2 concurrent instances of matrixMul from the SDK, they run concurrently. I assume that the increase in the execution time is due to SMs sharing, but the presented kernel is so small that it could be run without sharing SMs. We have profiled the executions and we have discovered that our kernel blocks other kernel execution, while matrixMul kernel doesn't block other executions. Do you know why? – Bub Espinja Oct 01 '15 at 14:27
3

Matrixmul won't run kernels concurrently either from two separate processes. You are misinterpreting something. The apps may appear to run concurrently but individual kernels will not when originating from separate processes. The profiler may show some kind of overlap but this will represent API activity. The kernels themselves will not run concurrently. – Robert Crovella Oct 01 '15 at 15:08
Thank for the answer, but we are launching 2 instances of matrixMul, where are executed 2 kernels in each one. The profiles of both executions show that at the same time there are 2 kernels running at the same time. As far as I know, CUDA is able to handle several kernels concurrently. – Bub Espinja Oct 01 '15 at 16:54
2

@siserte: CUDA supports concurrent kernels *within the same context*. It absolutely does not support concurrent contexts. Also, please check who you are arguing with here - robert crovella is an NVIDIA employee who specialises in this stuff. He is perfectly correct, and you are not, I'm afraid – talonmies Oct 01 '15 at 17:00
2

It is possible to capture cuda trace simultaneously from both processes using Nsight VSE or nvprof. This will clearly show that there is a context switch on the GPU between the two CUDA contexts. The only way to run two processes concurrently on CC 3.5 - CC5.x is to use the CUDA MPS server. – Greg Smith Oct 02 '15 at 01:17
@talonmies You are likely to be right in your explanation and I don't hesitate about the experience and knowledge of other users. Moreover I edited the question adding the profiles of the concurrent execution of our kernel and matrixMul, please check it. Besides, I have checked the MPS documentation [link](https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf) at point 3.2 trying to understand the concurrent context execution while sharing the GPU, there it is shown that multiple contexts can be time sliced. – Bub Espinja Oct 02 '15 at 07:19
@RobertCrovella please check the question again, because it has been extended with execution profiles. Your feedback will be really apreciate. Thank you. – Bub Espinja Oct 02 '15 at 07:21
1

matrixMul launches *many* kernels. You are showing the first two launches. No conclusions can be drawn from that. Furthermore, unless your applications are perfectly synchronized, so that the cuda kernel launch requests are issued at nearly the same time, you won't observe anything interesting in profiler traces that were captured independently. These two instances of matrixMul might have attempted to use the GPU at slightly different points in time, leading to no overlap at all. I think your comparison methodology is flawed. – Robert Crovella Oct 03 '15 at 18:08
1

@siserte: I have updated my answer to respond to your updates regarding your matrixMul example. I hope this demonstrates that you have just misinterpreted what is going on here and it doesn't contradict anything else you have been told in this discussion – talonmies Oct 05 '15 at 07:56

Concurrent execution of two processes sharing a Tesla K20

1 Answers1