Flask app is keep on loading at the time of prediction(TensorRT)

Question

This is in the continuation to the question

Facing issue while running Flask app with TensorRt model on jetson nano

Above is resolve but when I am running flask 'app' it keep loading and not showing video.

code:

def callback(): 
 cuda.init() 
 device = cuda.Device(0) 
 ctx = device.make_context() 
 onnx_model_path = './some.onnx' 
 fp16_mode = False
 int8_mode = False 
 trt_engine_path = './model_fp16_{}_int8_{}.trt'.format(fp16_mode, int8_mode)
 max_batch_size = 1 
 engine = get_engine(max_batch_size, onnx_model_path, trt_engine_path, fp16_mode, int8_mode) 
 context = engine.create_execution_context() 
 inputs, outputs, bindings, stream = allocate_buffers(engine) 
 ctx.pop()

##callback function ends


worker_thread = threading.Thread(target=callback())
worker_thread.start()

trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
 print("start in do_inferece")
 # Transfer data from CPU to the GPU.
 [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
 # Run inference.
 print("before run infernce in do_inferece")
 context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
 # Transfer predictions back from the GPU.
 print("before output in do_inferece")
 [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
 print("before stream synchronize in do_inferece")
 # Synchronize the stream
 stream.synchronize()
 # Return only the host outputs.
 print("before return in do_inferece")
 return [out.host for out in outputs]

Why are you not performing your inference inside the callback? — mibrahimy, Apr 15 '20 at 07:08

score 1 · Answer 1 · answered Apr 15 '20 at 07:15

Your worker_thread creates the context required for do_inference. You should call the do_inference method inside the callback()

def callback(): 
   cuda.init() 
   device = cuda.Device(0) 
   ctx = device.make_context() 
   onnx_model_path = './some.onnx' 
   fp16_mode = False
   int8_mode = False 
   trt_engine_path = './model_fp16_{}_int8_{}.trt'.format(fp16_mode, int8_mode)
   max_batch_size = 1 
   engine = get_engine(max_batch_size, onnx_model_path, trt_engine_path, fp16_mode, int8_mode) 
   context = engine.create_execution_context() 
   inputs, outputs, bindings, stream = allocate_buffers(engine) 
   trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
   # post-process the trt_outputs
   ctx.pop()

doesn't this mean I will create a context for every request? — Walid Hanafy, Jun 30 '20 at 15:25

Flask app is keep on loading at the time of prediction(TensorRT)

1 Answers1