Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use referenceScaling instead of scalabilityMode for decodingInfo queries #182

Closed
drkron opened this issue Aug 17, 2021 · 51 comments · Fixed by #189
Closed

Use referenceScaling instead of scalabilityMode for decodingInfo queries #182

drkron opened this issue Aug 17, 2021 · 51 comments · Fixed by #189

Comments

@drkron
Copy link
Contributor

drkron commented Aug 17, 2021

During the implementation in Chrome I've come across an issue related to scalabilityMode.

Scalability mode specifies how the video stream should be encoded and puts a few requirements on the encoder. In theory all decoders should support decoding of all valid streams regardless of which scalability mode that was used (https://www.w3.org/TR/webrtc-svc/). In reality this is not always the case and it may happen that a decoder is not able to decode a stream due to certain properties of the stream. All cases I know of where this happens are tied to reference scaling, which means that a frame of resolution A is used as a reference when decoding a frame of resolution B != A.

My proposal is therefore to add a boolean called referenceScaling to the dictionary VideoConfiguration.

referenceScaling would be an optional member that could be set when querying decodingInfo(). The member scalabilityMode remains in the dictionary but will only be allowed when querying encodingInfo.

@drkron
Copy link
Contributor Author

drkron commented Aug 17, 2021

Ping to @chcunningham, since he's an editor, and a few RTC experts representing different browsers to get their opinion on this proposal,
@jan-ivar @youennf @aboba @alvestrand

@chcunningham
Copy link
Contributor

My first thought would be to instead add scalabilityMode to decodingInfo. This is hopefully easier to use and more future proof against a future where some new mode is introduced with features that break decoding support. WDYT?

@aboba
Copy link

aboba commented Aug 19, 2021

@chcunningham That's the approach taken in WebRTC-SVC where a decoder that can only decode a subset of scalabilityMode values can provide the list of the values it supports in scalabilityModes. Each peer figures out what they can send by computing the intersection of the scalabilityModes supported by their encoder and the peer decoder.

The only tricky part is interpreting the meaning when the decoder returns no scalabilityModes attribute. This means "all modes supported" when the encoder indicates that it supports at least one scalabilityMode value and "no modes supported" otherwise. This works as long as there are no implementations that support scalable encoding but not decoding. If that assumption seems like a bridge too far, we could add a "*" value to explicitly indicate support for all scalabilityMode values on the decoder.

@DanilChapovalov
Copy link

I do not understand what does it mean "decoder supports certain scalability mode".

When there is scalability, there is filtering. Decoder sees only part of the the stream.
Ability to decode that part doesn't depend on how full stream looked like before filtering.

e.g. even if encoder produce L3T3 stream, smart middlebox may detect client doesn't support referenceScaling and forward such client just the lower spatial layer. that substream would be decodable.
If middle box would forwarding same decoder different subpart of the same stream (e.g. two spatial layers), it might not be decodable.

On the other hand, even without scalability it is possible to encode stream with increasing resolution. such substream wouldn't be decodable (by decoder that can't change resolution on the fly).

inability to decode streams encoded with certain scalability mode is a symptom. Root cause is lack of reference scaling support.

@drkron
Copy link
Contributor Author

drkron commented Aug 23, 2021

It seems like there are two issues to resolve here:

  • Should the query for decoder support be based on reference scaling or scalability mode? I'm flexible but think that Danil has a good point in that scalability mode best describes encoder properties and that somethings else such as reference scaling might be more suitable.
  • Should this be part of the video configuration that is sent to encoding/decodingInfo or be part of the return value? I think that this should be part of video configuration for the same reason that profile level is part of the video configuration. For example, although both VP9 profile 0 and profile 2 are supported by a client, HW decoding may be used for profile 0 but not for profile 2, so the return values for smooth and powerEfficient may not be the same. Similarly for reference scaling/scalabilty mode, HW decoding might be used if reference scaling is not required and SW decoding used otherwise.

@alvestrand
Copy link

An encoder producing an L3T3 stream might encounter a middle box that reduces this to L3T1 (invented term - "no temporal scaling"), which would work for decoders that support reference scaling - or might choose to reduce it to L1T3 for decoders that don't support reference scaling but do support temporal scalability, or L1T1 for decoders that don't support either.

What this shows is that in the case of the smart middle box, it's the middle box that needs to know the receiver's capabilities; the sender doesn't need to know, and probably shouldn't know when the stream has several recipients - it's the job of the middle box to figure out what to tell the sender what makes sense.

@aboba
Copy link

aboba commented Aug 23, 2021

In SFU scenarios, the SFU is indeed a "peer" for the SVC capability exchange, just as it participates in signaling.

@drkron
Copy link
Contributor Author

drkron commented Aug 23, 2021

Yes, I agree, but I'm not sure if that changes anything. I'm thinking that the MediaCapabilities API could then be used to get the information that is needed by the middle box to make the decision of what to forward etc. So the question remains if reference scaling should be used or if it should be hidden in various scalability modes that are supported or not?

@alvestrand
Copy link

alvestrand commented Aug 24, 2021 via email

@DanilChapovalov
Copy link

Video producer still might need to take reference scaling into consideration.
If video producer generates non scalable video stream where next frame has different resolution than the previous one, SFM can't slice out a substream that doesn't use reference scaling.

@alvestrand
Copy link

Reference scaling support on the decoder may constrain the encoder's choice of modes, alternatively the SFU's choice of what to strip out.

If I look at the scalability mode dependency diagrams in https://www.w3.org/TR/webrtc-svc/#dependencydiagrams*, it seems that L2T1_KEY and L2T3_KEY_SHIFT are special in that they only require reference scaling at keyframes, but basically all Ln with L>1 seem to require reference scaling.

@chcunningham
Copy link
Contributor

I think I agree that referenceScaling works to answer all known the questions, but I dislike how it introduces new vocabulary for describing the SVC capabilities. It seems If we instead put scalabilityMode in VideoDecoderConfig, we get to re-use the existing definition, which seems simpler. Using scalabilityMode, callers can still determine what filtering, if any, is needed by initially calling with decoderConfig.scalabilityMode = encoderConfig.scalabilityMode and making a second call with a decoderConfig.scalabilityMode = filteredScalabilityMode if needed. This is no worse than the two calls needed for referenceScalaing = true followed by referenceScaling = false.

I'm also worried about whether referenceScaling is sufficient to address all future questions of decode support. Even now, are we certain that any device that supports decode for reference scaling in L2T1 can also support it L2T3_KEY_SHIFT? Even if the answer is yes, I worry for the future where some new mode is introduced such that referenceScaling is no longer adequate to describe what makes/breaks decode support. With scalabilityMode, the decode and encode configs advance in sync.

@alvestrand
Copy link

We do have the future-looking problem for the decoders returning "" for their reference scaling modes (for codecs where a compliant decoder is currently able to decode any mode).

List of modes has a bit of a safety edge over "referenceScaling"; if the drawbacks are known and acceptable, it is probably a viable way forward too. But: If we want to stick with list of modes for decoders, we should just ban "", and insist that people enumerate all the modes they know about when they support "everything". Which means that introducing new modes will not be beneficial until the decoders are upgraded, even if the new mode is decodable by any compliant decoder.

Costs on all sides.

@chcunningham
Copy link
Contributor

We do have the future-looking problem for the decoders returning "" for their reference scaling modes (for codecs where a compliant decoder is currently able to decode any mode).

IIUC you're referring to the behavior @aboba mentioned above. I'd like to understand it better. Is this defined in https://w3c.github.io/webrtc-pc ? Does it have any impact on MediaCapabilities, or are we just talking about RTC's getCapabilities()?

List of modes has a bit of a safety edge over "referenceScaling";

Note: for MC, I'm not proposing we list the modes (I agree w/ @drkron above; scalabilityMode should be part of the query). The model with MC is to ask about one thing at a time. Callers can walk the list with multiple calls.

@DanilChapovalov
Copy link

I still strongly think that question "does decoder supports that scalability mode?" is wrong.
Decoder should support all compliant bitstreams, regardless of encoder implementation or how exactly encoder decided to arrange references.
The problem 'referenceScaling' suggest to solve is that there are too many vp9 decoders that do not support some compliant bitstreams.

Let me try to explain it with a different semi-theoretical example. Suppose we have a decoder that do not support decoding odd resolutions. If in your application you're using resolution 320x180, then you may notice that decoder fails to handle it with any scalability mode with 3 spatial layers, but succeed with other scalability modes. However it is wrong to making conclusion that decoder doesn't support L3T3 mode in this example. This decoder can decode stream encoded in L3T3 mode with resolution 640x180, and would fail to decode stream encoded in L1T3 mode with resolution 80x45.
Proper way to workaround this bug would be to have some kind of 'supportsOddResolution' flag.

Same goes for 'referenceScaling'. While there is a correlation between this missed feature and supported scalability modes, it is incorrect to say decoder doesn't support certain scalability modes.

@chcunningham
Copy link
Contributor

Proper way to workaround this bug would be to have some kind of 'supportsOddResolution' flag.

I agree that resolution is not a property of scalabilityMode. This particular example could be handled today by returning support=false whenever config.width or config.height is odd.

Same goes for 'referenceScaling'. While there is a correlation between this missed feature and supported scalability modes, it is incorrect to say decoder doesn't support certain scalability modes.

My understanding is that reference scaling (unlike resolution) is a property of certain scalability modes. Concretely, if referenceScaling = false => scalabilityMode = LN* for N > 1 is not supported. Can you give an example of where this is wrong?

@DanilChapovalov
Copy link

Proper way to workaround this bug would be to have some kind of 'supportsOddResolution' flag.

I agree that resolution is not a property of scalabilityMode. This particular example could be handled today by returning support=false whenever config.width or config.height is odd.

I doubt it can be done that easily: requested resolution would be 320x180, which is even. but with 3 spatial layers each reducing resolution by half, the lowest resolution would be 80x45, which is a bit odd.
Same resolution would be decodable if encoder choose to encoder it differently (e.g. won't reduce resolution strictly by half, but use 80x46 for the lowest spatial layer. It can do it if it would know that one of the decoders doesn't support odd resolution)

Same goes for 'referenceScaling'. While there is a correlation between this missed feature and supported scalability modes, it is incorrect to say decoder doesn't support certain scalability modes.

My understanding is that reference scaling (unlike resolution) is a property of certain scalability modes. Concretely, if referenceScaling = false => scalabilityMode = LN* for N > 1 is not supported. Can you give an example of where this is wrong?

When there is no scalability, there is no requirement resolution should stay the same, i.e.
you can't conclude referenceScaling when there is no scalability mode.

L3T3_KEY can be decoded with referenceScaling = false if SFM chooses to forward bottom layer only.
(if SFM choose to relay a different spatial layer, result set of frames can't be described with a scalability mode, but it is still a valid bitstream)

structures webrtc uses for VP9 screenshare (not described in webrtc-svc spec) use spatial layers as quality layers, i.e. that structure is similar to L3T1, but resolution for all spatial layers is the same; referenceScaling = false

Afaik, the decoders that have problem with scalability, do support referenceScaling on key frames, but don't support increasing it on delta frames. That one can support any existent scalability mode, as long as SFM or Encoder would avoid delta frames without temporal dependencies.
I.e. in that case it is not about picking scalability mode, but how to do switching between spatial layers.

@chcunningham
Copy link
Contributor

@DanilChapovalov @drkron and I had a call this morning.

Afaik, the decoders that have problem with scalability, do support referenceScaling on key frames, but don't support increasing it on delta frames. That one can support any existent scalability mode, as long as SFM or Encoder would avoid delta frames without temporal dependencies.
I.e. in that case it is not about picking scalability mode, but how to do switching between spatial layers.

Bug details for ^this are here: http://crbug.com/1022359. The issue occurs when switching from an upper to lower spatial layer, with the lower layer frame being a delta frame that references a key frame from the higher layer. I finally grok the point @DanilChapovalov and @alvestrand were making. The SFU may filter down to produce a stream that is technically valid, but no longer matches any named scalabilityMode. My understanding is that codec specs give enough flexibility that we probably don't want to mint new scalabilityModes for every valid stream. I'm now leaning against using scalabilityMode for decoder configs. I don't want to give the impression that SFU's are restricted to just named modes.

Having said all that, referenceScaling also doesn't work to describing that particular bug. The decoder does support reference scaling generally, but does not support this particular scenario. Saying referenceScaling = false would harm this receiver, excluding many common forms of SVC that it actually supports. What you really need for that bug is something like referenceScalingingOnDeltaFrames. Yuck!

As a general guideline, MC should avoid describing bugs. The API will likely outlive most bugs, leaving warts that confuse developers down the road. Meet is already working around the above bug w/ server filtering logic that only switches resolution at key frames, and the longterm work to fix the ChromeOS decoder is still being tracked. Hence, I don't think that particular bug (http://crbug.com/1022359) motivates a change to MC.

We may still have a problem to solve, and referenceScaling may be the solution. @drkron noticed in local testing that his mac would fallback to software decoding when SVC was received. The theory is that AVFoundation may not support reference scaling generally (even at key frames), and that this may be a common gap for other platform decoders (MediaFoundation, MediaCodec, ...). But they may still support svc without reference scaling (i.e. just temporal scaling). He's looking into it further.

@youennf, @aboba: do you know the details of svc support on mac/windows?

@drkron
Copy link
Contributor Author

drkron commented Oct 4, 2021

Thanks for the summary @chcunningham!

I've done some tests now on both Windows and MacOS with VP9 hardware decoding. The only type of stream with spatial scaling that I could produce was a k-SVC stream or more specifically scalability mode L2T3_KEY. I was not able to decode this stream on either platform.

@chcunningham
Copy link
Contributor

Thanks, sounds promising.

@youennf, @aboba: do you know the details of svc support on mac/windows?

Friendly ping. Can you say whether referenceScaling = true/false is well suited to describe the capabilities of Mac / Windows?

@aboba
Copy link

aboba commented Oct 6, 2021

@chcunningham Is this a problem in the Media Foundation VP9 decoder? If so, I would file a Chromium bug, with CC: to [email protected] (Steve Becker). He can pull in the Sigma team. Also, are you seeing a similar problem with AV1 decoders?

@chcunningham
Copy link
Contributor

Thanks @aboba, I'll email him now.

@drkron can you confirm whether there are any lingering gaps in reference scaling support on ChromeOS (i.e. beyond the bug mentioned earlier)?

@youennf
Copy link

youennf commented Oct 12, 2021

Can you say whether referenceScaling = true/false is well suited to describe the capabilities of Mac / Windows?

I would think that on MacOS/iOS, referenceScaling=false is probably the most accurate choice as of now.

@drkron
Copy link
Contributor Author

drkron commented Oct 14, 2021

@drkron can you confirm whether there are any lingering gaps in reference scaling support on ChromeOS (i.e. beyond the bug mentioned earlier)?

I've asked around and the ChromeOS support for spatial scalability seems to be split into the following groups:
Not supported.
k-SVC supported and probably general SVC but this is not tested. However, resolution change at non-keyframes is not supported at the moment.

@chcunningham
Copy link
Contributor

I would think that on MacOS/iOS, referenceScaling=false is probably the most accurate choice as of now.

Thanks @youennf. Just to note: offline you mentioned this is just for VP9. Do you know if MacOS/iOS has any plans to add this support? Is it even feasible?

I've asked around and the ChromeOS support for spatial scalability seems to be split into the following groups:
Not supported.
k-SVC supported and probably general SVC but this is not tested. However, resolution change at non-keyframes is not supported at the moment.

Thanks @drkron. Same question as above: can we fix this in software? I know we're tracking a fix for group 2 (albeit, without much urgency).

@youennf
Copy link

youennf commented Oct 19, 2021

Do you know if MacOS/iOS has any plans to add this support? Is it even feasible?

We did a WebRTC specific fix (https://bugs.webkit.org/show_bug.cgi?id=231071).
Getting more requests from web developers about adding HW support is always useful to prioritise this work.

@drkron
Copy link
Contributor Author

drkron commented Oct 20, 2021

To summarize the discussion, there seem to be three different levels of support for scalability mode on the decoder side:

  • No support.
  • k-SVC support.
  • General SVC support.

Although a few HW decoders might be moved from No support to k-SVC support, I don't think that we can expect that all HW bugs will be solved or even be classified, it's probably not even feasible. Falling back to SW decoding for k-SVC streams which is what Chrome does and now also Safari does, emphasizes the need of a field in VideoConfiguration to query on scalability mode also for decoders to get the correct predictions.

Based on the three levels of support listed above, I propose that this issue is closed and the specification is kept as. That is, with an optional scalabilityMode field that can be used for querying both encoding and decoding info. The motivation to use scalabilityMode is that it’s more suitable to distinguish between k-SVC support and general SVC support and also is an existing concept, whereas referenceScaling would be to introduce a new term.

@chcunningham
Copy link
Contributor

chcunningham commented Oct 27, 2021

That is, with an optional scalabilityMode field that can be used for querying both encoding and decoding info.

I can support that. Saying referenceScaling = false in cases where kSVC is actually supported seems pretty harmful.

From earlier discussion, it's regrettable that scalabilityMode only allows for N predefined arrangements of layers, but allowing kSVC where supported is a higher priority. In practice I think we callers can use heuristics like giving the closest matching scalabilityMode or we can mint new scalabilityMode values if needed.

@chcunningham
Copy link
Contributor

For folks reading along (@aboba @DanilChapovalov @alvestrand @youennf), please yell if you have any objections / concerns / better ideas concerning my previous comment. I'd like to settle this discussion and unblock @drkron.

@DanilChapovalov
Copy link

I still think it is bad idea to use scalability mode to describe partial decoding support, but I don't have new arguments.

Scalability mode is an existent concept for encoder, however it is not defined for decoder process, so to use it, one still have to define what does it mean "supports decoding certain scalability mode". In particular it would be nice to describe how it can be tested.

Even if you would define it, I still do not see how scalability mode can describe all the scenarios:
video stream without scalability but with resolution changed on delta frame might still fail to decode.
video stream with quality layers (spatial layers are used, but all have the same resolution), i.e. screenshare use case is not covered by provided 3 scenarios, nor by any scalability mode from the webrtc-svc spec.
(+nit: provided 3 levels do not hint if temporal layering can be used. I do not see what kind of features realistic decoder might lack that temporal layering relies on. Is it safe to assume temporal layering always can be used?)

@chcunningham
Copy link
Contributor

Re: definitions, my first thought is along the lines of: the decoder is capable of decoding a sequence of frames described by the given mode, ordered by time and dependencies.

WDYT? I think this definition is usable for encoding as well. Just s/decoder/encoder.

Re: scenarios, I agree that scalabilityMode does not cover those.

  • For resolution-change-on-delta-frame, my read is that this is not common or painful enough to warrant permanent listing on a web capability API. Its tracked as bug, which describes a possible fix and active workaround.
  • For quality layers: Are there existing methods of receiver capability detection? Or, are receivers assumed to broadly support? If not, perhaps we should mint new scalabilityModes?

I definitely hear your points; scalabilityMode isn't perfect by any stretch. Would you agree that it's better than referenceScaling for answering the critical questions (e.g. is k-svc supported)? Do you see a better way?

@DanilChapovalov
Copy link

I do not think encoding and decoding definitions can be symmetric.
e.g. for K-SVC scalability modes, decoder is not suppose to be able to decode all frames, it should be able to decode any valid subset of the frames. (in RTC there is no guarantee all frames will be available during decoding: some frames are filtered by an SFM, some are dropped by the network and receiver can detect they are not required to continue decoding)
Then there is gray area when encoder produces stream while some layers are temporarily disabled, then re-enables those layers.
e.g. would these frames be considered a valid subset for L2T1 scalability mode? I think it should.

      O <- O
      V    V
KF <- O <- O

So rather than trying to define what does it mean 'decoder support scalability mode' and then describe support levels with scalability modes, I think it is better to have new words for different features that decoder doesn't support and work directly on their definitions.
"ReferenceScaling" idea suggested to have two levels of support, but it seems there is desire to have 3 levels of support.
Tbh I'm not entirely sure what those levels are (supports decoding 'K-SVC' doesn't explain me what exactly is supported, which is probably my main problem; and I still have no idea how widely quality layers are unsupported).
One guess that features can be.
-no spatial scalability (decoder doesn't support several frames per picture/temporal unit)
-no reference scaling (decoder doesn't support predicting from a frame with a different resolution)
-no midstream display resolution scaling (decoder doesn't support changing render resolution, aka changing resolution of a displayed frame)
Decoder then should report set of features it doesn't support.
To have better understanding it might good idea to test various decoders not with scalability modes, but with video streams that use uncommon features listed above.

I do not think there are existent methods for detecting what receivers don't support. Afaik software decoders supports all features, so as soon as hardware decoder fails to decode, software fallback is used. Most applications do not use spatial scalability and do encounter these problems in the first place.

@DanilChapovalov
Copy link

To clarify features above, consider following examples (q - a frame encoded in qvga resolution, O - a frame encoded in VGA resolution):

q <- q <- O <- ...

This structure (without scalability) requires prediction from different resolution (2nd feature), but doesn't require several frames per temporal unit (1st feature).

O <- O
|    |
O <- O

This structure require several frames per temporal unit (1st feature), but doesn't require prediction from a frame with different resolution (2nd feature).

Support for the 3rd feature (change of display resolution) likely imply support of prediction from different resolution (2nd feature). k-svc example below demonstrate that they are not the same.

O <- O
|
q <- q

This does require several frame per temporal unit (1st feature) and prediction from different resolution (2nd feature),
but doesn't require change of the display resolution (3rd feature)

Full svc with delayed spatial upswitch:

     O <- O
     |    |
q <- q <- q

Requires all 3 features above.

Temporal scalability:

   q    q
  /    /
q <- q <- q

doesn't require any of the 3 features above.

@chcunningham
Copy link
Contributor

chcunningham commented Nov 11, 2021

@DanilChapovalov @drkron spoke offline again.

We concluded that the earlier description of decoders that support just k-SVC is not quite right. While k-SVC is the most tested configuration, it is expected that such decoders actually support all manner of SVC. There are known bugs (ex) in that support, but so far none that deserve permanent documentation via MediaCapabilities.

So this leaves just two states for svc support: true and false. That suggests that the decoder signal could be reduced to something like bool svc;. This remedies @DanilChapovalov's concerns with scalabilityMode and gives UAs the info they need to set powerEfficienct = true/false. @DanilChapovalov and @drkron indicated they could support something like this, but we didn't flesh out the details. Lets do that now...

bool svc would encompass both spatial (including resolution and quality) and temporal scalability. But noting @DanilChapovalov's earlier point: maybe it's unnecessary to have a signal for temporal scalability (should just work?). So we're left with bool spatialScalibility. Can we similarly assume that same-resolution spatial scalability (e.g. screen share quality layers) just works everywhere? If yes, we've come full circle back to bool referenceScaling.

@chcunningham
Copy link
Contributor

@DanilChapovalov and @drkron indicated they could support something like this, but we didn't flesh out the details. Lets do that now...

Friendly ping for thoughts on proposals in my final paragraph.

@drkron
Copy link
Contributor Author

drkron commented Nov 23, 2021

If yes, we've come full circle back to bool referenceScaling.

I think you're right about this. I can prepare a PR unless there are objections from someone else?

@chcunningham
Copy link
Contributor

So we're left with bool spatialScalibility. Can we similarly assume that same-resolution spatial scalability (e.g. screen share quality layers) just works everywhere? If yes, we've come full circle back to bool referenceScaling.

This bit really needs confirmation from @DanilChapovalov.

@DanilChapovalov
Copy link

I would like to think that every hardware decoder supports quality layers, but I'm worry it might not be true. I'm not aware of such decoders though.
(for temporal layers I'm quite sure those are supported by all decoders)
I'm ok with current definition of the referenceScaling. Lack of quality layers support in decoders would be treated as bugs then.

@chcunningham
Copy link
Contributor

Thanks. In summary, we now agree drop scalabilityMode from the decoder configuration and instead use referenceScaling. The encoder configuration would still use scalabilityMode. @aboba @alvestrand - are y'all on board? If so, we'll send a PR.

(aside: we should probably reorganize the various dictionaries to not inherit each other, as encode/decode divergence makes it clumsy).

@aboba
Copy link

aboba commented Dec 2, 2021

@chcunningham What would we do in the case of H.264/AVC with temporal scalability? This is supported in WebCodecs today (both for encoding and decoding).

This isn't H.264/SVC so spatial scalability can't be supported on the decoder side (e.g. only modes would be 'L1T2' and 'L1T3'). So even if reference scaling is supported, that wouldn't imply support for spatial scalability. The same is true of VP8.

@aboba
Copy link

aboba commented Dec 5, 2021

@chcunningham I have submitted a PR to remove RTCRtpReceiver.getCapabilities():
w3c/webrtc-svc#54

In the PR, it is recommended to utilize MC, and to interpret referenceScaling as follows:

  • If referenceScaling has the value 'false', then the decoder cannot decode spatial scalability, but can be assumed to support all other scalability mode values that an encoder can encode.
  • Otherwise (referenceScaling has the value true or is absent), then the decoder can decode any scalability mode that the encoder can encode.

In the case of H.264/AVC with temporal scalability, we would presumably have referenceScaling set to false (since spatial scalability can't be supported).

Does this make sense?

@drkron
Copy link
Contributor Author

drkron commented Dec 7, 2021

Thank you! That makes sense to me.

@chcunningham Regarding the need to reorganize the various dictionaries. What do you think of instead keeping it as is but having something like

  • "Scalability mode is only applicable to the MediaEncodingType webrtc."
  • "Reference scaling is only applicable to the MediaDecodingType webrtc."

in the description of each field. This way a lot of duplication is avoided or do you still think it's clumsy?

@aboba
Copy link

aboba commented Dec 8, 2021

Why would scalabiltyMode and referenceScaling only apply to WebRTC? I could see this info being of use with WebCodecs and some transport (e.g. RTCDataChannel or WebTransport).

@drkron
Copy link
Contributor Author

drkron commented Dec 8, 2021

Why would scalabiltMode and referenceScaling only apply to WebRTC? I could see this info being of use with WebCodecs and some transport (e.g. RTCDataChannel or WebTransport).

This may be a misunderstanding from my side, but I thought that the scalability modes that we use here are the ones defined in https://www.w3.org/TR/webrtc-svc/, which seems to be targeted towards WebRTC?

If it makes sense for the types "file", "media-source", and "record", we could of course have something like:

  • "Scalability mode is only applicable to MediaEncodingConfiguration".
  • "Reference scaling is only applicable to MediaDecodingConfiguration."

@chcunningham
Copy link
Contributor

chcunningham commented Dec 9, 2021

@aboba and I met today. Recording our conclusions and some new questions.

@chcunningham What would we do in the case of H.264/AVC with temporal scalability? This is supported in WebCodecs today (both for encoding and decoding).

The current plan is to assume temporal scalability is always supported, irrespective of reference scaling.

@chcunningham I have submitted a PR to remove RTCRtpReceiver.getCapabilities():
w3c/webrtc-svc#54

Thanks. Generally looks good, but then your example made me wonder: how does the app learn of SFM recv capabilities now? I hadn't considered this before.

  • If referenceScaling has the value 'false', then the decoder cannot decode spatial scalability, but can be assumed to support all other scalability mode values that an encoder can encode.
  • Otherwise (referenceScaling has the value true or is absent), then the decoder can decode any scalability mode that the encoder can encode.

Generally the meanings above sound correct, but we should swap the default value of referenceScaling to false for backward compatibility. Defaulting to true could change resulting MediaCapabilitiesInfo values when compared to today's results for a given combination of inputs.

One other thing is that the language above should be modified to make it clear that we're talking about the track (content) rather than the decoder. This improves consistency w/ existing MC semantics and avoids constraining decoder selection for referenceScaling = false (i.e. fine to choose a decoder that does support reference scaling even if it isn't used by the content).

@chcunningham Regarding the need to reorganize the various dictionaries. What do you think >of instead keeping it as is but having something like

  • "Scalability mode is only applicable to the MediaEncodingType webrtc."
  • "Reference scaling is only applicable to the MediaDecodingType webrtc."

Thinking on it more, keeping the inheritance is ok. I did a quick pass on member:context validity just now (#187) and it's complicated enough that we shouldn't try to further separate the dictionaries. We should have some validity checks though. Please weigh in on that issue.

To answer this specific question, I agree scalabilityMode is only desired for webrtc for now. It does make sense for WebCodecs too, but that's not currently part of MC. For referenceScaling, my first thought is to allow this for all of file/media-source/webrtc, as it is technically possible outside of webRTC (even if it's not used much in practice).

@chcunningham
Copy link
Contributor

  • If referenceScaling has the value 'false', then the decoder cannot decode spatial scalability, but can be assumed to support all other scalability mode values that an encoder can encode.

Forgot to note one other thing: here @aboba is using the term spatial scalability to mean scalability modes that change the resolution. For me this is the intuitive meaning, but just highlighting it because earlier in this thread there was some discussion of "spatial" scalability for quality layers where resolution remains fixed. For the purposes of that PR, such scalability is not "spatial". Just FYI for folks following along.

@aboba
Copy link

aboba commented Dec 9, 2021

Submitted PR w3c/webrtc-svc#56 to reflect Chris's guidance on default behavior.

@chcunningham
Copy link
Contributor

Thanks, will take a look tomorrow. @aboba did you see this question in my wall of text above?

but then your example made me wonder: how does the app learn of SFM recv capabilities now? I hadn't considered this before.

@aboba
Copy link

aboba commented Dec 9, 2021

The SFM can send the info on codecs/modes it can receive in the format that would have been used by RTCRtpReceiver.getCapabilities(). The intersection of the browser's RTCRtpSender.capabilities(kind) and the SFM's simulated RTCRtpReceiver.getCapabilties(kind) is what the browser can send to the SFM. Since the SFM typically isn't a browser, removing support for the RTCRtpReceiver.getCapabilities() method on the browser doesn't impact the SFM.

@chcunningham
Copy link
Contributor

chcunningham commented Dec 20, 2021

Thanks @aboba .

Re-reading my comments above triggers one last bikeshed:
Should we call this spatialScalability instead of referenceScaling to reduce confusion?

I think this is less ambiguous. "Scaling" in SVC refers generally to any of "spatial", "temporal" and/or "quality" scaling... and we only want to describe spatial.

On the other hand, folks may point out that reference scaling can be used outside of SVC. In practice I think that's rare enough that using SVC vernacular to name this member is still a fine call.

@drkron @DanilChapovalov

@DanilChapovalov
Copy link

I like 'spatialScalability' better than 'referenceScaling'.
Even outside SVC referencing a frame of different resolution can be seen as 'rescaling' the referred frame
However simulcast can also be seen as 'spatial scalability', but it is the job of the definition/detail description to catch all the nuances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants