I want to measure ONLY the inference time in the Jetson TX2. How can I improve my function to do that? As right now I am measuring:
the transfer of the image from CPU to GPU
transfer of results from GPU to CPU
the inference
Or is that not possible because of the way GPUs work? I mean, how many times will I have to use stream.synchronize()
if I divide/segment the function into 3 parts:
- transfer from CPU to GPU
- Inference
- transfer from GPU to CPU
Thank you
CODE IN INFERENCE.PY
def do_inference(engine, pics_1, h_input, d_input, h_output, d_output, stream, batch_size):
"""
This is the function to run the inference
Args:
engine : Path to the TensorRT engine.
pics_1 : Input images to the model.
h_input: Input in the host (CPU).
d_input: Input in the device (GPU).
h_output: Output in the host (CPU).
d_output: Output in the device (GPU).
stream: CUDA stream.
batch_size : Batch size for execution time.
height: Height of the output image.
width: Width of the output image.
Output:
The list of output images.
"""
# Context for executing inference using ICudaEngine
with engine.create_execution_context() as context:
# Transfer input data from CPU to GPU.
cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference.
#context.profiler = trt.Profiler() ##shows execution time(ms) of each layer
context.execute(batch_size=1, bindings=[int(d_input), int(d_output)])
# Transfer predictions back from the GPU to the CPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream.
stream.synchronize()
# Return the host output.
out = h_output
return out
CODE IN TIMER.PY
for i in range (count):
start = time.perf_counter()
# Classification - calling TX2_classify.py
out = eng.do_inference(engine, image, h_input, d_input, h_output, d_output, stream, 1)
inference_time = time.perf_counter() - start
print("TIME")
print(inference_time * 1000)
print("\n")
pred = postprocess_inception(out)
print(pred)
print("\n")