How might I achieve adding a track of accurately aligned real-time "additional" data with live-streamed audio? Primarily interested in the browser here, but ideally the solution would be possible with any platform.
The idea is, if I have a live recording from my computer being sent into Icecast via something like DarkIce, I want a listener (who could join a stream at any time) to be able to place some kind of annotation over a few of the samples and allow them to send only the annotation back (for example, using a regular HTTP request). However, this needs a mechanism to align the annotation with the dumped streamed audio at the server side, and in a live stream, the user AFAIK can't actually get the timestamp in the "whole" stream, just from when they joined. But if there was some kind of simultaneously aligned metadata, then perhaps this would be possible.
The problem is, most systems seem to assume you "pre-caption" or multi-plex your data streams beforehand. However, this wouldn't make sense for something being recorded and live-streamed in real-time. Google's examples seem to be mostly around their ability to do "live captioning" which is more about processing audio in real-time then adding slightly delayed captions using speech recognition. This isn't what I'm after. I've looked into various ways data is put into OGG containers, as well as the current captioning like WebVTT, and I am struggling to find examples of this.
I found maybe a hint here: https://github.com/w3c/webvtt/issues/320 and I've been recommended to look for examples by Apple and Google using WebVTT for something along these lines, but cannot find these demos. There's older tech as well (Kate, CMML, Annodex, etc) but none of these are in use and are completely replaced by WebVTT. Perhaps I can achieve something like this web WebRTC, but I'm not sure this gives any guarantees on alignment and it's a slightly different technology stack that I am looking at in this scenario.