4

I am building a WebRTC application in which users can share their camera and their screen. When a client receives a stream/track, it needs to know whether it is a camera stream or a screen recording stream. This distinction is obvious at the sending end, but the distinction is lost by the time the tracks reach the receiving peer.

Here's some sample code from my application:

// Note the distinction between streams is obvious at the sending end.
const localWebcamStream = await navigator.mediaDevices.getUserMedia({ ... });
const screenCaptureStream = await navigator.mediaDevices.getDisplayMedia({ ... });

// This is called by signalling logic
function addLocalTracksToPeerConn(peerConn) {
  // Our approach here loses information because our two distinct streams 
  // are added to the PeerConnection's homogeneous bag of streams

  for (const track of screenCaptureStream.getTracks()) {
    peerConn.addTrack(track, screenCaptureStream);
  }

  for (const track of localWebcamStream.getTracks()) {
    peerConn.addTrack(track, localWebcamStream);
  }
}

// This is called by signalling logic
function handleRemoteTracksFromPeerConn(peerConn) {
    peerConn.ontrack = ev => {
      const stream = ev.streams[0];
      if (stream is a camera stream) {  // FIXME how to distinguish reliably?
        remoteWebcamVideoEl.srcObject = stream;
      }
      else if (stream is a screen capture) {  // FIXME how to distinguish reliably?
        remoteScreenCaptureVideoEl.srcObject = stream;
      }
  };
}

My ideal imaginary API would allow adding a .label to a track or stream, like this:

// On sending end, add arbitrary metadata
track.label = "screenCapture";
peerConn.addTrack(track, screenCaptureStream);

// On receiving end, retrieve arbitrary metadata
peerConn.ontrack = ev => {
      const trackType = ev.track.label;  // get the label when receiving the track
}

But this API does not really exist. There is a MediaStreamTrack.label property, but it's read-only, and not preserved in transmission. By experimentation, the .label property at the sending end is informative (e.g. label: "FaceTime HD Camera (Built-in) (05ac:8514)"). But at the receiving end, the .label for the same track is is not preserved. (It appears to be replaced with the .id of the track - in Chrome, at least.)

This article by Kevin Moreland describes the same problem, and recommends a mildly terrifying solution: munge the SDP on the sending end, and then grep the SDP on the receiving end. But this solution feels very fragile and low-level.

I know there is a MediaStreamTrack.id property. There is also a MediaStream.id property. Both of these appear to be preserved in transmission. This means I could send the metadata in a side-channel, such as the signalling channel or a DataChannel. From the sending end, I would send { "myStreams": { "screen": "<some stream id>", "camera": "<another stream id>" } }. The receiving end would wait until it has both the metadata and the stream before displaying anything. However, this approach introduces a side-channel (and inevitable concurrency challenges associated with that), where a side-channel feels unnecessary.

I'm looking for an idiomatic, robust solution. How do I label/identify MediaStreams at the sending end, so that the receiving end knows which stream is which?

jameshfisher
  • 34,029
  • 31
  • 121
  • 167

3 Answers3

9

I ended up sending this metadata in the signaling channel. Each signaling message that contained a SessionDescription (SDP) now also contains metadata object alongside it, which annotates the MediaStreams that are described in the SDP. This has no concurrency issues, because clients will always receive the SDP+metadata for a MediaStream before the track event is fired for that MediaStream.

So previously I had signaling messages like this:

{
  "kind": "sessionDescription",

  // An RTCSessionDescriptionInit
  "sessionDescription": { "type": "offer", "sdp": "..." }
}

Now I have signaling messages like this:

{
  "kind": "sessionDescription",

  // An RTCSessionDescriptionInit
  "sessionDescription": { "type": "offer", "sdp": "..." },

  // A map from MediaStream IDs to arbitrary domain-specific metadata
  "mediaStreamMetadata": {
    "y6w4u6e57654at3s5y43at4y5s46": { "type": "camera" },
    "ki8a3greu6e53a4s46uu7dtdjtyt": { "type": "screen" }
  }
}
jameshfisher
  • 34,029
  • 31
  • 121
  • 167
  • 2
    Good answer. You might want to mention this works because `screenCaptureStream` and `localWebcamStream` get replicated remotely with matching `id`s because you mentioned them in `addTrack`. – jib Dec 22 '20 at 21:36
6

A more canonical approach to signalling a custom stream label with metadata would be to modify the SDP prior to sending (but after setLocalDescription) and modify the msid attribute (which stands for media stream id, see the specification). The advantage here is that on the remote end the media stream id attribute is parsed and visible in the stream of the ontrack event. See this fiddle

Note that you can not make any assumptions about the track id. In Firefox, the track id in the SDP does not even match the track id on the sender side.

Community
  • 1
  • 1
Philipp Hancke
  • 15,855
  • 2
  • 23
  • 31
  • My main problem is that experience has taught me to see FLASHING RED ALARMS whenever a regex is used on structured data ... I would be more comfortable if there were a proper SDP parser/serializer, or a sensible API for structured editing of the SDP – jameshfisher Dec 30 '20 at 20:06
  • 1
    see https://github.com/otalk/sdp which I wrote or https://github.com/clux/sdp-transform And yeah, treating SDP as stringsoup *cough*... – Philipp Hancke Dec 30 '20 at 21:49
2

A third way is to rely on the deterministic order of transceivers:

const pc1 = new RTCPeerConnection(), pc2 = new RTCPeerConnection();

go.onclick = () => ["Y","M","C","A"].forEach(l => pc1.addTrack(getTrack(l)));

pc2.ontrack = ({track, transceiver}) => {
  const video = [v1, v2, v3, v4][pc2.getTransceivers().indexOf(transceiver)];
  video.srcObject = new MediaStream([track]);
};

pc1.onicecandidate = e => e.candidate && pc2.addIceCandidate(e.candidate);
pc2.onicecandidate = e => e.candidate && pc1.addIceCandidate(e.candidate);
pc1.onnegotiationneeded = async () => {
  await pc1.setLocalDescription(await pc1.createOffer());
  await pc2.setRemoteDescription(pc1.localDescription);
  await pc2.setLocalDescription(await pc2.createAnswer());
  await pc1.setRemoteDescription(pc2.localDescription);
};

function getTrack(txt, width = 100, height = 100, font = "100px Arial") {
  const can = Object.assign(document.createElement("canvas"), {width,height});
  const ctx = Object.assign(can.getContext('2d'), {font});
  requestAnimationFrame(function draw() {
    ctx.fillStyle = '#eeeeee';
    ctx.fillRect(0, 0, width, width);
    ctx.fillStyle = "#000000";
    ctx.fillText(txt, width/2 - 14*width/32, width/2 + 10*width/32);
    requestAnimationFrame(draw);
  });
  return can.captureStream().getTracks()[0];
};
<button id="go">Go!</button><br>
<video id="v1" autoplay></video>
<video id="v2" autoplay></video>
<video id="v3" autoplay></video>
<video id="v4" autoplay></video>
<div id="div"></div>

This works well when you're in control of negotiation, like when initial negotiation happens from one side only.

It works less well when both sides can initiate negotiation, because when both sides create transceivers, their order isn't necessarily deterministic anymore.

In those cases you're better off signaling ids like transceiver.mid or stream.id out of band like the other answers show. I cover this in detail in my blog.

jib
  • 40,579
  • 17
  • 100
  • 158
  • 2
    Nice. I was not aware these are ordered. I think I prefer my solution because it makes fewer assumptions, but it's good to know :-) – jameshfisher Dec 22 '20 at 23:34