The process of using two models for inferencing one single data is carried out under model ensembling
in YoloV5.
Model Ensembling Tutorial clearly defines:
Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction.
So, model ensembling can improve mAP and Recall during testing and inference but the two models should be trained for the same classes.
Same is clarified in issue#1188
So, a workaround here may be, using the output video from one inferencing as an input to the inferencing for the second model.