Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RTCRtpEncodedSource and explainer #198

Closed
wants to merge 0 commits into from
Closed

Add RTCRtpEncodedSource and explainer #198

wants to merge 0 commits into from

Conversation

guidou
Copy link

@guidou guidou commented Feb 8, 2024

Add an explainer for the upcoming RTCRtpSenderEncodedSource extension proposal


Preview | Diff

@henbos henbos self-requested a review February 8, 2024 15:39
@guidou guidou force-pushed the main branch 2 times, most recently from 0e50773 to d059b2b Compare March 5, 2024 14:12
@guidou guidou changed the title Add relay-explainer.md Add RTCRtpEncodedSource Mar 5, 2024
@guidou guidou changed the title Add RTCRtpEncodedSource Add RTCRtpEncodedSource and explainer Mar 5, 2024
@guidou
Copy link
Author

guidou commented Mar 5, 2024

@aboba
Copy link
Contributor

aboba commented Mar 7, 2024

It looks like this PR inadvertently removes Section 14 Event Summary. I fixed this.

@guidou
Copy link
Author

guidou commented Mar 14, 2024

It looks like this PR inadvertently removes Section 14 Event Summary. I fixed this.

Thanks!

@youennf
Copy link
Contributor

youennf commented Mar 14, 2024

My main thought is that this API requires to wait for all packets of a frame, which creates some latency. This solution does not allow to do what an SFU is doing with the same level of performance. It might be a good enough compromise. If we already consider supporting packet forwarding, maybe that is what we should do instead.

API wise, transferring seems harder than using what we have done for RTCRtpScriptTransform. We should consider the pros and cons of both approaches.
For instance, the current proposal is allowing to transfer to cross agent cluster dedicated workers, which is not allowed with RTCRtpScriptTransform. I see that more as an issue than a feature.
I also think that an API where you create the source where you use it (instead of creating in worker and then transferring) is slightly easier to use.
From an implementation point of view, it would also be simpler (at least in Safari) compared to supporting transferability.
From a spec point of view, it would also be simpler since we would not have to deal with when you can transfer and when you cannot.

It is not clear why RTCRtpSenderEncodedSource is not providing a WritableStream like done for RTCRtpScriptTransform.

@guidou
Copy link
Author

guidou commented Mar 18, 2024

My main thought is that this API requires to wait for all packets of a frame, which creates some latency. This solution does not allow to do what an SFU is doing with the same level of performance. It might be a good enough compromise. If we already consider supporting packet forwarding, maybe that is what we should do instead.

This is a good compromise for the use case of glitch-free forwarding with multiple input peer-connections with failover. In this case, frames provide a convenient abstraction to do failover quickly (no need for timeouts, just forward a frame from the first peer connection that provides it). It is also ergonomic for some other SFU-like operations where the outcome is frame-based (e.g., drop frames that don't satisfy a certain property).

There is also an effort going on for a packet-level API, although we have not heard yet about developers interested in that API for the use case of glitch-free forwarding using multiple input peer connections. There are other use cases driving the design of that API.

API wise, transferring seems harder than using what we have done for RTCRtpScriptTransform. We should consider the pros and cons of both approaches. For instance, the current proposal is allowing to transfer to cross agent cluster dedicated workers, which is not allowed with RTCRtpScriptTransform. I see that more as an issue than a feature. I also think that an API where you create the source where you use it (instead of creating in worker and then transferring) is slightly easier to use.

We actually only need to transfer to workers within the same agent cluster, so maybe a clarification is needed in the text of the proposed spec. However, I do agree with you that it is better to be able to create the source where you use it.

The main reasons the proposal exposes RTCRtpSenderEncodedSource only on DedicatedWorker are:

  • Your proposal on which this is text is based exposes it only on workers
  • In other similar APIs we have always had agreement about exposing on DedicatedWorker, and disagreement about exposing on Window. So, I thought exposing only on DedicatedWorker would lead to faster consensus, while the larger discussion about exposing things on Window could continue separately.

From an implementation point of view, it would also be simpler (at least in Safari) compared to supporting transferability From a spec point of view, it would also be simpler since we would not have to deal with when you can transfer and when you cannot.

I agree with you on this point. The idea was to deviate as little as possible from w3c/webrtc-encoded-transform#211 (comment) in order to make it easier to achieve consensus, but if you agree about exposing RTCRtpSenderEncodedSource on Window, then we can simplify this by removing transferrability.

It is not clear why RTCRtpSenderEncodedSource is not providing a WritableStream like done for RTCRtpScriptTransform.

I agree that a WritableStream would provide more flexibility here. Again, the reason was to minimize deviations from w3c/webrtc-encoded-transform#211 (comment)

If I understand correctly, your concerns can be summarized as follows:

  • The extra latency a frame-based API has compared with a potential packet-based API
  • RTCRtpSenderEncodedSource is exposed only on DedicatedWorker.
  • RTCRtpSenderEncodedSourceHandle is transferrable.
  • RTCRtpSenderEncodedSource does not expose a WritableStream to write frames.

We have already presented arguments in favor of the frame-based API and in previous discussions we have concluded that while latency may be a small disadvantage in some cases, frame-based has some clear advantages, in particular for the scenario of forwarding with glitch-free failover over multiple input peer connections.

The concerns about the spec text I think can be addressed as follows:

  • Expose RTCRtpSenderEncodedSource on Window.
  • Make RTCRtpSenderEncodedSourceHandle not transferrable, or maybe just remove it and use the source directly.
  • Expose a WritableStream instead of an enqueue() method.

WDYT?

@guidou guidou removed the request for review from henbos March 18, 2024 13:55
@youennf
Copy link
Contributor

youennf commented Mar 18, 2024

I would summarise my concern as:

  • The potential redundancy and the extra latency a frame-based API has compared with a potential packet-based API
    My other non blocking comment is API shape (keeping the feature set as it is, meaning let's keep it to dedicated worker in this version):
  • Evaluate transfer-based vs. script-transform-like API

As I commented on w3c/webrtc-encoded-transform#211 (comment), there are two different API proposals. The first one is well described in the explainer.
The second is a script transform like API, something like:

[Exposed=Window]
interface RTCRtpSenderEncodedSource {
    constructor(Worker worker, optional any options, optional sequence<object> transfer)
};
[Exposed=DedicatedWorker]
interface RTCRtpSenderEncodedSourceController {
    WritableStream writable; // Or enqueue method.
    // Need congestion API, error API and enqueue API
};
partial interface DedicatedWorkerGlobalScope {
    attribute EventHandler onrtcencodedsource;
}

From a user perspective, there is probably no difference.
There are probably some differences in terms of future-proofness, ease of use, ease of spec/dev...

Comment on lines 41 to 91
```
// code in main.js file
const worker = new Worker('worker.js');

// Let recvPc1, recvPc2 be the receiving PCs.
recvPc{1|2}.ontrack = evt => {
evt.receiver.transform = new RTCRtpScriptTransform(worker, { name: "receiverTransform" });
};


// Let relayPc be the PC used to relay frames to the next peer.
worker.onmessage = evt => {
relayPc.replaceTrack(evt.data);
};
```

```
// code in worker.js file
async function relayFrames(reader, writer, encodedSource) {
if(!reader || !writer || !encodedSource){
return;
}
while (true) {
const {frame, done} = await reader.read();
if (done) return;

let newFrame = new RTCRtpEncodedVideoFrame(frame, getUnifiedMetadata(frame));
if(!isDuplicate(newFrame)) {
encodedSource.enqueue(newFrame);
}
// Put the original frame back in the receiver PC
writer.write(frame);
}
}

// Code to instantiate reader and writer from the RTPReceiver and RTPSender.
onrtctransform = (event) => {
if (event.transformer.options.name == "receiverTransform") {
reader = event.transformer.readable;
writer = event.transformer.writable;
if (!encodedSource) {
encodedSource = new RTCRtpSenderEncodedSource();
postMessage(encodedSource.handle);
}
} else {
return;
}

relayFrames(reader, writer, encodedSource);
};
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guidou I think I found a couple of bugs and I'd prefer if we used .pipeTo. LMK if I got this right:

// main.js
const worker = new Worker('worker.js');
// Let recvPc1, recvPc2 be the receiving PCs. 
recvPc1.ontrack = ({receiver}) =>
  receiver.transform = new RTCRtpScriptTransform(worker, {name: "receiverTransform"});

recvPc2.ontrack = ({receiver}) =>
  receiver.transform = new RTCRtpScriptTransform(worker, {name: "receiverTransform"});

// Let relayPc be the PC used to relay frames to the next peer.
const [sender] = relayPc.getSenders();
worker.onmessage = async ({data}) => await sender.replaceTrack(data.handle);


// worker.js
let encodedSource;

onrtctransform = async ({transformer: {readable, writable, options}}) => {
  if (options.name != "receiverTransform") return;
  if (!encodedSource) {
    encodedSource = new RTCRtpSenderEncodedSource();
    postMessage({handle: encodedSource.handle});
  }
  await readable.pipeThrough(new TransformStream({transform})).pipeTo(writable);

  function transform(frame, controller) {
    const newFrame = new RTCRtpEncodedVideoFrame(frame, getUnifiedMetadata(frame));
    if (!isDuplicate(newFrame)) {
      encodedSource.enqueue(newFrame);
    }
    controller.enqueue(frame);
  }
}

But how well will this will work? recvPc1 and recvPc2 are racing here to provide frames to a single outgoing relayPC? You mention redundancy as the reason, but it's odd seeing recvPc1, recvPc2 being merged into a single output. It looks like fan-in, not a fan-out. Is this important to emphasize in the example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a couple of bugs and I'd prefer if we used .pipeTo. LMK if I got this right:
Yes, you got it right. I'll update the explainer to use this version.

But how well will this will work? recvPc1 and recvPc2 are racing here to provide frames to a single outgoing relayPC? You mention redundancy as the reason, but it's odd seeing recvPc1, recvPc2 being merged into a single output. It looks like fan-in, not a fan-out. Is this important to emphasize in the example?

Yes, I would call this fan-in and it is key for glitch-free failover, so it important to emphasize it.
I will also add fan-out to the example for completeness.

Copy link
Member

@jan-ivar jan-ivar Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what about the racing? onrtctransform is going to fire twice here and free-run pumping frames from recvPc1 and recvPc2, two independent peer connections, through the transform function in parallel. What syncs up the output from these two independent sources? What assures these encoded frames have any relation? isDuplicate?

For fail-over, what's wrong with replaceTrack?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redo based on @youennf's proposed API:

// main.js
const worker = new Worker('worker.js');
// Let recvPc1, recvPc2 be the receiving PCs. 
recvPc1.ontrack = ({receiver}) =>
  receiver.transform = new RTCRtpScriptTransform(worker, {name: "receiverTransform"});

recvPc2.ontrack = ({receiver}) =>
  receiver.transform = new RTCRtpScriptTransform(worker, {name: "receiverTransform"});

// Let relayPc be the PC used to relay frames to the next peer.
const [sender] = relayPc.getSenders();
const encodedSource = new RTCRtpSenderEncodedSource(worker);
await sender.replaceTrack(encodedSource);

// worker.js
let encodedWritable;
onrtcencodedsource = ({controller: {writable}}) => encodedWritable = writable;

onrtctransform = async ({transformer: {readable, writable, options}}) => {
  if (options.name != "receiverTransform") return;
  const [readable1, readable2] = readable.tee();
  await Promise.all([
    readable1.pipeTo(writable),
    readable2.pipeThrough(new TransformStream({transform})).pipeTo(encodedWritable)
  ]);

  function transform(frame, controller) {
    const newFrame = new RTCRtpEncodedVideoFrame(frame, getUnifiedMetadata(frame));
    if (!isDuplicate(newFrame)) {
      controller.enqueue(newFrame);
    }
  }
}

This doesn't solve my racing concerns, but I thought it might be helpful to show what the application code might look like with the different API shapes.

E.g. note this uses the tee() function which might run into whatwg/streams#1156 since unlike WebCodecs our encoded chunks are not immutable (should be fine in this case since there's no transform on readable1, but it's brittle).

@jan-ivar
Copy link
Member

jan-ivar commented Mar 18, 2024

How does this stack up against just optimizing the obvious/inoptimal API? E.g.

// Let relayPc be the PC used to relay frames to the next peer.
const [sender] = relayPc.getSenders();
// Let recvPc be the receiving PC
recvPc.ontrack = ({receiver}) => await sender.replaceTrack(receiver.track);
sender.transform = new RTCRtpScriptTransform(worker, {});

Browsers could detect when the input is a receiver.track and forward received RTCEncodedVideoFrames directly on the sender transform's readable (skipping decode and re-encode) where JS can modify metadata.

@guidou
Copy link
Author

guidou commented Mar 20, 2024

I would summarise my concern as:

  • The potential redundancy and the extra latency a frame-based API has compared with a potential packet-based API

My position about this is that a frame-based API and a packet-based API offer different tradeoffs and therefore support different use cases.
For use cases where the decision is based on frame properties, the extra latency of the frame-based API is not a problem since you have to wait for the whole frame to make the decision. For the glitch-free failover case (a.k.a. fan-in) it is also good because it provides an ergonomic mechanism that increases reliability without introducing timeouts (other than the latency of waiting for a frame).
Also, an important practical consideration is that we have production-tested building blocks that make it easy to deploy a frame-based API, whereas a packet-based API is still in early stages of discussion with challenges that may still be unknown.

My other non blocking comment is API shape (keeping the feature set as it is, meaning let's keep it to dedicated worker in this version):

  • Evaluate transfer-based vs. script-transform-like API

As I commented on w3c/webrtc-encoded-transform#211 (comment), there are two different API proposals. The first one is well described in the explainer. The second is a script transform like API, something like:

[Exposed=Window]
interface RTCRtpSenderEncodedSource {
    constructor(Worker worker, optional any options, optional sequence<object> transfer)
};
[Exposed=DedicatedWorker]
interface RTCRtpSenderEncodedSourceController {
    WritableStream writable; // Or enqueue method.
    // Need congestion API, error API and enqueue API
};
partial interface DedicatedWorkerGlobalScope {
    attribute EventHandler onrtcencodedsource;
}

From a user perspective, there is probably no difference. There are probably some differences in terms of future-proofness, ease of use, ease of spec/dev...

Something I like about this alternative shape is the consistency with webrc-encoded-transform, which will give us the opportunity to evolve both APIs in a similar manner. For example, we can define the congestion API in the same manner for both. It also addresses your concern about transferability.
I'm fine with proceeding with this shape if you and @jan-ivar are also OK with it.

@jan-ivar
Copy link
Member

jan-ivar commented Mar 20, 2024

I understand we got here incrementally, but I don't see why JS is needed when we can add first-class support for fanout. Is the source of an RTCRtpEncodedSource always RTCRtpReceiver? If so, why not skip the JS for this WebIDL:

Promise<undefined> replaceTrack((MediaStreamTrack or RTCRtpReceiver)? withTrackOrReceiver);

Apps could then declare encoded-level forwarding explicitly:

recvPc.ontrack = async ({receiver}) => await relayPc.getSenders()[0].replaceTrack(receiver);

The app can fail-over using replaceTrack to a different RTCRtpReceiver.

@youennf
Copy link
Contributor

youennf commented Mar 21, 2024

I understand we got here incrementally, but I don't see why JS is needed when we can add first-class support for fanout.

AIUI, the desire is for the web page to act a little bit like an SFU. I do not think the UA will be able to implement everything that a web page could do, for instance:

  • Handle the case where there is bandwidth limitation (the receiver can try to do some adaptation, say if the receiver is L1T2, it might only send L1T1, switch from source...).
  • Handle the case where the receiver gets redundant frames from multiple paths and will forward from one of this path on a per frame basis. I think this use case was presented but @guidou probably knows more.
  • Handle recovery cases, like packet loss or PLI/FIR (which of the multiple potential receivers to ask for the key frame, what to do when waiting...).

@youennf
Copy link
Contributor

youennf commented Mar 21, 2024

I'm fine with proceeding with this shape if you and @jan-ivar are also OK with it.

This API shape seems indeed to have some benefits over the transfer based API, I would tend to proceed with this one.

@jan-ivar
Copy link
Member

AIUI, the desire is for the web page to act a little bit like an SFU.

Is the goal still to repurpose RTCRtpScriptTransform to run an SFU in JS?

For example, previous choices there like RTCEncodedVideoFrame being mutable and its serialization steps not copying the ArrayBuffer, was tailored to the simple frame modification use cases of encrypt/decrypt and add metadata.

How do we reconcile that SFU use cases seem better served by immutable RTCEncodedVideoFrames (per the tee problem? Does not a perfect fit mean not the right surface?

w3c/webrtc-encoded-transform#134 discusses creating one from another, but it's not clear how one would go from immutable to mutable without a copy.

@dontcallmedom-bot
Copy link

@alvestrand
Copy link
Collaborator

We can argue that mutable RTCEncodedVideoFrame was a design mistake at the time.
However, the prospect of dealing with two variants of RTCEncodedVideoFrame, one mutable and one not, does not exactly fill me with joy.
We should ensure that the implementation of RTCEncodedVideoFrame is performant for the non-mutating case, and users will get the point when they need the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants