Realtime meta-data/ captioning for live streamed audio

Question

How might I achieve adding a track of accurately aligned real-time "additional" data with live-streamed audio? Primarily interested in the browser here, but ideally the solution would be possible with any platform.

The idea is, if I have a live recording from my computer being sent into Icecast via something like DarkIce, I want a listener (who could join a stream at any time) to be able to place some kind of annotation over a few of the samples and allow them to send only the annotation back (for example, using a regular HTTP request). However, this needs a mechanism to align the annotation with the dumped streamed audio at the server side, and in a live stream, the user AFAIK can't actually get the timestamp in the "whole" stream, just from when they joined. But if there was some kind of simultaneously aligned metadata, then perhaps this would be possible.

The problem is, most systems seem to assume you "pre-caption" or multi-plex your data streams beforehand. However, this wouldn't make sense for something being recorded and live-streamed in real-time. Google's examples seem to be mostly around their ability to do "live captioning" which is more about processing audio in real-time then adding slightly delayed captions using speech recognition. This isn't what I'm after. I've looked into various ways data is put into OGG containers, as well as the current captioning like WebVTT, and I am struggling to find examples of this.

I found maybe a hint here: https://github.com/w3c/webvtt/issues/320 and I've been recommended to look for examples by Apple and Google using WebVTT for something along these lines, but cannot find these demos. There's older tech as well (Kate, CMML, Annodex, etc) but none of these are in use and are completely replaced by WebVTT. Perhaps I can achieve something like this web WebRTC, but I'm not sure this gives any guarantees on alignment and it's a slightly different technology stack that I am looking at in this scenario.

Sounds like what you really need is to know where the listener is in the live stream, as far as timestamp goes, so that way you can associate your metadata with that timestamp. If you're streaming with MP3, there are no timestamps because there really isn't a container... but try WebM instead. I think you could still use your Icecast setup, and I think those timestamps will make it through. Ogg may also work. — Brad, Apr 20 '20 at 15:56
Thanks @Brad, that makes sense. To do it that way, would probably involve exposing [this variable](https://github.com/xiph/Icecast-Server/blob/de3a0755001a87c4c8fa91f878fd98f935792b8f/src/connection.c#L270) as the admin stats do, and then storing the value along with the data provided by the listener. — Louis, Apr 22 '20 at 09:47
No, that won't help you any. The encoder connected time doesn't tell you much about where the encoder is, time-wise. (Buffers, etc.) And, the listener's clock-on-the-wall time doesn't tell you where they are relative to the encoder. (Buffers, clock drift, outright different playback rate, etc.) You need the timestamp in the actual stream itself, which if I'm remembering correctly, you get via WebM or Ogg through Icecast, all the way through `currentTime` on the audio element. — Brad, Apr 22 '20 at 14:36

Realtime meta-data/ captioning for live streamed audio

0 Answers0