How to measure ONLY the inference time in the GPU, using TensorRT and PyCUDA?

Question

I want to measure ONLY the inference time in the Jetson TX2. How can I improve my function to do that? As right now I am measuring:

the transfer of the image from CPU to GPU
transfer of results from GPU to CPU
the inference

Or is that not possible because of the way GPUs work? I mean, how many times will I have to use stream.synchronize() if I divide/segment the function into 3 parts:

transfer from CPU to GPU
Inference
transfer from GPU to CPU

Thank you

CODE IN INFERENCE.PY

def do_inference(engine, pics_1, h_input, d_input, h_output, d_output, stream, batch_size):

    """
    This is the function to run the inference
    Args:
      engine : Path to the TensorRT engine. 
      pics_1 : Input images to the model.  
      h_input: Input in the host (CPU). 
      d_input: Input in the device (GPU). 
      h_output: Output in the host (CPU). 
      d_output: Output in the device (GPU). 
      stream: CUDA stream.
      batch_size : Batch size for execution time.
      height: Height of the output image.
      width: Width of the output image.
    
    Output:
      The list of output images.

    """
      
    # Context for executing inference using ICudaEngine
    with engine.create_execution_context() as context:
        
        # Transfer input data from CPU to GPU.
        cuda.memcpy_htod_async(d_input, h_input, stream)

        # Run inference.
        #context.profiler = trt.Profiler() ##shows execution time(ms) of each layer
        context.execute(batch_size=1, bindings=[int(d_input), int(d_output)])

        # Transfer predictions back from the GPU to the CPU.
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        
        # Synchronize the stream.
        stream.synchronize()
        
        # Return the host output.
        out = h_output       
        return out

CODE IN TIMER.PY

for i in range (count):
    start = time.perf_counter()
    # Classification - calling TX2_classify.py
    out = eng.do_inference(engine, image, h_input, d_input, h_output, d_output, stream, 1) 
    inference_time = time.perf_counter() - start
    print("TIME")
    print(inference_time * 1000)
    print("\n")
    pred = postprocess_inception(out)
    print(pred)
    print("\n")

If you want to find the inference time on GPU only, you can wrap the `context.exectute` with timer statements. You won't need to you use `stream.synchronize()` instead use `cuda.memcpy_htod` which are blocking statements. In the current code, are you including the preprocessing time too? — mibrahimy, Nov 26 '20 at 14:38

How to measure ONLY the inference time in the GPU, using TensorRT and PyCUDA?

0 Answers0