I am looking into an application that requires to detect the delay in receiving video frames and then takes action if a delay is detected. The delay in receiving video frames is perceived as a video freeze on the render window. The action is insertion of an IMU frame in between the live video since the video freeze has occurred. Following are the pipelines :
The Tx-Rx are connected in an adhoc mode using WiFi with no more devices. Also only video is transmitted, audio is not a concern here.
Tx(iMX6 device):
v4l2src fps-n=30 -> h264encode -> rtph264pay -> rtpbin -> udpsink(port=5000) ->
rtpbin.send_rtcp(port=5001) -> rtpbin.recv_rtcp(port=5002)
Rx(ubuntu PC):
udpsrc(port=5000) -> rtpbin -> rtph264depay -> avdec_h264 -> rtpbin.recv_rtcp(port=5001) ->
rtpbin.send_rtcp(port=5002) -> custom IMU frame insertion plugin -> videosink
Now as per my application, I intend to detect the delay in receiving frames at the Rx device. The delay can be induced by a number of factors including:
- congestion
- packet loss
- noise , etc.
Once the delay is detected, I intend to insert a IMU(inertial measurement unit) frame (custom visualization) in between the live video frame. For eg, if every 3rd frame is delayed, the video will look like:
V | V | I | V | V | I | V | V | I | V | .....
where V - video frame received and I - IMU frame inserted at Rx device
Hence as per my application requirements, to achieve this I must have a knowledge of the timestamp of the video frame sent from Tx, and use this timestamp with the current timestamp at Rx device to get the delay in transmitting.
frame delay = Current time at Rx - Timestamp of frame at Tx
Since I am working at 30 fps, ideally I should expect that I receive video frames at the Rx device every 33ms. Given the situation that its WiFi, and other delays including encoding/decoding I understand that this 33ms precision is difficult to achieve and its perfectly fine for me.
- Since, I am using RTP/RTCP , I had a look into WebRTC but it caters more towards sending SR/RR (network statistics) only for a fraction of the data sent from Tx -> Rx. I also tried using the UDP source timeout feature that detects if there are no packets at the source for a predefined time and issues signal notifying the timeout. However, this works only if the Tx device completely stops(pipeline stopped using Ctrl+C). If the packets are delayed, the timeout does not occur since the kernel buffers some old data.
I have the following questions :
Does it make sense to use the timestamps of each video frame/RTP buffers to detect the delay in receiving frames at the Rx device ? What would be a better design to consider for such an usecase ? Or is it too much overhead to consider the timestamp of each frame/buffer and may be I can consider timestamps of factor of video frames like every 5th video frame/buffer, or every 10 the frame/buffer? Also the RTP packets are not same as FPS, which means for a 30 fps video, I can receive more than 30 RTP buffers in GStreamer. Considering the worst case possible where each alternate frame is delayed the video would have the following sequence :
V | I | V| I | V | I | V | I | V | I | .....
I understand that the precision of each alternate frame can be difficult to handle, so I am targetting a detection and insertion of IMU frame atleast within 66 ms. Also the switching between live video frame and insertion frame is a concern. I use the OpenGL plugins to do IMU data manipulation.
Which timestamps should I be considering at the Rx device? To calculate the delay, I need a common reference between the Tx and Rx device, which I do not have a knowledge about. I could access the PTS and DTS of the RTP buffers, but since no reference was available I could not use this to detect the delay. Is there any other way I could do this?
My caps has the following parameters (only few parameters showed) :
caps = application/x-rtp , clock-rate = 90000, timestamp-offset = 2392035930,seqnum-offset= 23406
Can this be used to calculate the reference at Tx and Rx ? I am not sure if I understand these numbers and how to use them at Rx device to get a reference. Any pointers on understanding these parameters?
- Any other possible approaches that can be undertaken for such an application. My above idea could be too impractical and I am open to suggestions to tackle this issue.