GStreamer: Calculate delay in received video frames/buffers to detect communication delay between Tx and Rx

Question

I am looking into an application that requires to detect the delay in receiving video frames and then takes action if a delay is detected. The delay in receiving video frames is perceived as a video freeze on the render window. The action is insertion of an IMU frame in between the live video since the video freeze has occurred. Following are the pipelines :

The Tx-Rx are connected in an adhoc mode using WiFi with no more devices. Also only video is transmitted, audio is not a concern here.

Tx(iMX6 device):

v4l2src  fps-n=30 -> h264encode ->  rtph264pay -> rtpbin -> udpsink(port=5000) ->
rtpbin.send_rtcp(port=5001) -> rtpbin.recv_rtcp(port=5002)

Rx(ubuntu PC):

udpsrc(port=5000) -> rtpbin -> rtph264depay -> avdec_h264 -> rtpbin.recv_rtcp(port=5001) -> 
rtpbin.send_rtcp(port=5002) -> custom IMU frame insertion plugin -> videosink

Now as per my application, I intend to detect the delay in receiving frames at the Rx device. The delay can be induced by a number of factors including:

congestion
packet loss
noise , etc.

Once the delay is detected, I intend to insert a IMU(inertial measurement unit) frame (custom visualization) in between the live video frame. For eg, if every 3rd frame is delayed, the video will look like:

                    V | V | I | V | V | I | V | V | I | V | .....

where V - video frame received and I - IMU frame inserted at Rx device

Hence as per my application requirements, to achieve this I must have a knowledge of the timestamp of the video frame sent from Tx, and use this timestamp with the current timestamp at Rx device to get the delay in transmitting.

frame delay = Current time at Rx - Timestamp of frame at Tx

Since I am working at 30 fps, ideally I should expect that I receive video frames at the Rx device every 33ms. Given the situation that its WiFi, and other delays including encoding/decoding I understand that this 33ms precision is difficult to achieve and its perfectly fine for me.

Since, I am using RTP/RTCP , I had a look into WebRTC but it caters more towards sending SR/RR (network statistics) only for a fraction of the data sent from Tx -> Rx. I also tried using the UDP source timeout feature that detects if there are no packets at the source for a predefined time and issues signal notifying the timeout. However, this works only if the Tx device completely stops(pipeline stopped using Ctrl+C). If the packets are delayed, the timeout does not occur since the kernel buffers some old data.

I have the following questions :

Does it make sense to use the timestamps of each video frame/RTP buffers to detect the delay in receiving frames at the Rx device ? What would be a better design to consider for such an usecase ? Or is it too much overhead to consider the timestamp of each frame/buffer and may be I can consider timestamps of factor of video frames like every 5th video frame/buffer, or every 10 the frame/buffer? Also the RTP packets are not same as FPS, which means for a 30 fps video, I can receive more than 30 RTP buffers in GStreamer. Considering the worst case possible where each alternate frame is delayed the video would have the following sequence :
```
           V | I | V| I | V | I | V | I | V | I | ..... 
```
I understand that the precision of each alternate frame can be difficult to handle, so I am targetting a detection and insertion of IMU frame atleast within 66 ms. Also the switching between live video frame and insertion frame is a concern. I use the OpenGL plugins to do IMU data manipulation.
Which timestamps should I be considering at the Rx device? To calculate the delay, I need a common reference between the Tx and Rx device, which I do not have a knowledge about. I could access the PTS and DTS of the RTP buffers, but since no reference was available I could not use this to detect the delay. Is there any other way I could do this?
My caps has the following parameters (only few parameters showed) :

caps = application/x-rtp , clock-rate = 90000, timestamp-offset = 2392035930,seqnum-offset= 23406

Can this be used to calculate the reference at Tx and Rx ? I am not sure if I understand these numbers and how to use them at Rx device to get a reference. Any pointers on understanding these parameters?

Any other possible approaches that can be undertaken for such an application. My above idea could be too impractical and I am open to suggestions to tackle this issue.

score 3 · Answer 1 · answered Nov 06 '18 at 13:24

3

You can get an absolute NTP time from RTP/RTCP. Check upon the RTP RFC. Understand how stream synchronization between streams are done. Basically it is that each audio and video stream know nothing from each other. But each stream does have its own RTP time base and sends information over RTCP what this time base represents in NTP.

So - for each frame you can get its NTP time representation. So assuming your devices are correctly synced to NTP you should be able to compare the received NTP time to the current NTP time of the receiver and you should have - roughly I guess - the delay between the two.

If you have multiple packets per frame that does not make much of a difference. All packets belonging to one frame should carry the same timestamps. So you probably want to catch the first one - and if you receive packets with timestamps you already know you just ignore them.

How accurate that actually is - I don't know. Usually video streams have high peak frames (Key frames) but sending is usually smoothed out to prevent packet loss. That will introduces quite a lot of jitter for measuring things you are trying to do..

answered Nov 06 '18 at 13:24

Florian Zwoch

6,764
2
12
21

In fact, you don't need to sync to NTP both Tx and Rx devices, but you could use time difference between frames for Tx and Rx separately and compare them. Also you could estimate time difference between devices as minimum of `(frame received time (Ntp Rx) - sent time (Ntp Tx))` for all frames, then use the difference for frame analysis. – Kozyr Nov 06 '18 at 14:52
@Kozyr & @Florian If I understand correctly, I must add a probe on the Rx pipeline on the element `rtpbin`, and analyse the buffer timestamp. This will give me the sent time(Ntp Tx). Is this correct ? How can I receive the Ntp Rx? Also the [RFC RTP](https://tools.ietf.org/html/rfc3550#section-5), states: `The timestamp pairs(NTP and RTP) are not transmitted in every data packet, but at a lower rate in RTCP SR packets as described in Section 6.4.` So, how is the NTP timestamp available for each buffer in this case? – gst Nov 06 '18 at 15:37
1

@vk_gst From RTCP SR you get pair (NTP, timestamp) and then from each frame timestamp you are able to determine corresponding NTP time - 90000 ticks of timestamp is equal to 1 second – Kozyr Nov 06 '18 at 15:47
@Kozyr : So the frame/buffer timestamp are already synchronized and I do not need to decode it manually using the RTCP pairs. Also I am confused now, regarding which would be the correct position to add a probe and analyse the timestamp at Rx pipeline. Would it be `udpsrc`, `rtpbin`, or `avdec_h264` ? Since I am using python-gstreamer, I am assuming this is the [timestamp](https://valadoc.org/gstreamer-1.0/Gst.Buffer.duration.html) that I should be targeting at. Is this correct? – gst Nov 06 '18 at 16:18
@vk_gst on Rx side you need manually convert rtp timestamp to ntp time (using base pair [ntp,timestamp] from rtcp and current frame's timestamp). I've never worked with gstreamer, but looks like `rtpbin` is the best place to retrieve rtp timestamp and received time (if gstream provides) for a packet/frame – Kozyr Nov 06 '18 at 17:11
I'm not sure either how detailed GStreamer offers you this info. In worst case you need to capture at UDP src and parse into the RTP header yourself. – Florian Zwoch Nov 06 '18 at 20:23
Recently found out that the element `rtph264depay`, provides [statistics](https://thiblahute.pages.gitlab.gnome.org/gnome-devel-docs/rtp-1.0/rtph264depay.html?gi-language=c#GstRtpH264Depay:stats) like seqnum and timestamp. I could retrieve the timestamp of each RTP packet, by adding a probe. Now I have no idea if this is a timestamp from source(Tx device) while sending, or a timestamp applied at the receiver(Rx device) while receiving. The source code is also not of much help here. Any suggestions ?? @FlorianZwoch and @Kozyr – gst Nov 08 '18 at 14:44
Pipeline timestamps are saved in a GstBuffer and usually start around 0. So if you can get your hand on timestamps that seem to not start at 0 (because they are randomized at start) you chances are good that these are RTP time stamps. When in doubt check the code ;-) – Florian Zwoch Nov 08 '18 at 14:57
Yes, the numbers do not start from 0. They are random numbers. So do you mean to say that these timestamps are the RTP timestamps which were applied at Tx side? Are there no additional RTP timestamps applied at Rx for the RTP packets received? I was of the opinion that separate RTP timestamps are applied at both Rx and Tx, individually to the packets sent and received. – gst Nov 08 '18 at 15:15
No, what would a timestamp on RX side bring to the table - protocol wise? The RX knows when it was received. These should be directly from the payload. – Florian Zwoch Nov 08 '18 at 16:13
bingo! Now I need to find a way to get the NTP timestamp from the RTP timestamp for a packet with seq no. xx and Rx NTP for the same RTP packet, and it would give me the frame delay then. How are the NTP synchronized at both ends(Tx/Rx), since the devices are connected in adhoc mode with no access to internet? Is there any direct conversion available to NTP from RTP? – gst Nov 08 '18 at 16:27
I was able to get the RTCP packets and retrieve the NTP and RTP timestamps of the SR(sender report). Here is a list of values from SR of the same RTCP packet : `RTP: 1578212183, NTP: 3724833613.552702161`, `RTP: 1578531284, NTP: 3724833616.2895877404`. I still do not understand the mapping of RTP to NTP. Any pointers how to map this? @FlorianZwoch , @Kozyr – gst Nov 20 '18 at 15:35
They map directly. The RTP time stamp given maps to that NTP. With the RTP timescale (usually 90000 for video - but may vary) You have a direct translation between NTP and RTP. So on your receiving machine you can query current NTP and compare to received RTP (in NTP). The difference is the latency between the too - well minus the NTP offset between the machines whatever that may be. – Florian Zwoch Nov 20 '18 at 15:55
@FlorianZwoch : I think I was a bit terse in explaining my question. The RTP and NTP timestamps I mentioned in the comment above, is received from the RTCP packets. The frequency of RTCP packets is approx. once every 5seconds. I was considering to take the RTP timestamps of RTP packets. In such cases, i.e. in case for normal RTP packets, I only have the RTP timestamps at the receiving machine. I was asking how to convert these RTP timestamps(of RTP packets) to NTP timestamps, so that I can compare the differences. The clock rate set in caps is 90000. – gst Nov 21 '18 at 08:55
I don't understand. All you need is _one_ RTCP SR. You get one RTP/NTP pair from that. So you know that this particular RTP represents _that_ NTP for this stream. With that info you can calculate the NTP from any RTP time stamp. – Florian Zwoch Nov 21 '18 at 09:14
exactly, I was asking for the mathematics behind the calculation. Considering this pair `RTP: 1578212183, NTP: 3724833613.552702161`, and then I have only`RTP: 1578531284 ` . How would I convert this to get the corresponding NTP value ? – gst Nov 21 '18 at 09:31
2

Check https://en.wikipedia.org/wiki/Network_Time_Protocol#Timestamps for what NTP is. That representation is basically "[seconds].[fraction_of_second]". So if you would have RTP of 3000 and NTP of 10.0 - an RTP of 93000 would be an NTP of 11.0 (at an RTP time scale of 90000. 1 second is 90000 in RTP). I will leave it up to you to find the correct translation :-) – Florian Zwoch Nov 21 '18 at 09:40

GStreamer: Calculate delay in received video frames/buffers to detect communication delay between Tx and Rx

1 Answers1