From 05fd5af890919fb864e983a044d2584a1425350b Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Mon, 20 Sep 2021 00:58:07 +0100
Subject: [PATCH 01/24] MSC3401: Native Group VoIP Signalling

---
 proposals/3401-group-voip.md | 263 +++++++++++++++++++++++++++++++++++
 1 file changed, 263 insertions(+)
 create mode 100644 proposals/3401-group-voip.md

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
new file mode 100644
index 00000000000..31ad1e8230b
--- /dev/null
+++ b/proposals/3401-group-voip.md
@@ -0,0 +1,263 @@
+# MSC3401: Native Group VoIP signalling
+
+## Problem
+
+VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room.
+This has some limitations, especially if you try to broaden the approach to multiparty VoIP calls:
+
+ * VoIP signalling can generate a lot of events as candidates are incrementally discovered, and for rapid call setup these need to be relayed as rapidly as possible.
+   * Putting these into the room timeline means that if the client has a gappy sync, for VoIP to be reliable it will need to go back and fill in the gap before it can process any VoIP events, slowing things down badly.
+   * Timeline events are (currently) subject to harsh rate limiting, as they are assumed to be a spam vector.
+ * VoIP signalling leak IP addresses.  There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
+ * Candidates are ephemeral data, and there is no reason to keep them around for posterity - they're just clogging up the DAG.
+
+Meanwhile we have no native signalling for group calls at all, forcing you to instead embed a separate system such as Jitsi, which has its own dependencies and doesn't directly leverage any of Matrix's encryption, decentralisation, access control or data model.
+
+## Proposal
+
+This proposal provides a signalling framework using to-device messages which can be applied to native Matrix 1:1 calls, full-mesh calls, SFU calls, cascaded SFU calls and in future MCU calls, and hybrid SFU/MCU approaches. It replaces the early flawed sketch at [MSC2359](https://github.com/matrix-org/matrix-doc/pull/2359).
+
+This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.
+
+Diagramatically, this looks like:
+
+1:1:
+```
+          A -------- B
+```
+
+Full mesh between clients
+```
+          A -------- B
+           \       /
+            \     /
+             \   /
+              \ /
+               C
+```
+
+SFU (aka Focus):
+```
+          A __    __ B
+              \  /   
+               F 
+               | 
+               |
+               C
+```
+
+Cascaded decentralised SFU:
+```
+     A1 --.           .-- B1
+     A2 ---Fa ----- Fb--- B2
+           \       /
+            \     /
+             \   /
+              \ /
+               Fc
+              |  |
+             C1  C2
+```
+
+### m.conf state event
+
+The user who wants to initiate a call sends a `m.conf` state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using `m.terminated`). Its body has the following fields:
+
+ * `m.intent` to describe the intended UX for handling the call.  One of:
+     * `m.ring` if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)
+     * `m.conference` is the call should be presented as a conference call which users in the room may connect to
+     * `m.immersive` if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
+ * `m.type` to say whether the type of call is voice only (`m.voice`) or video (`m.video`)
+ * `m.conf_id` as a unique identifier for the current ongoing call.  (We can't use the event ID, given `m.type` is mutable.  However, does this risk users causing problems with maliciously colliding IDs?).
+ * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (do we need a duration, or can we figure that out from the previous state event?)
+ * `m.name` as an optional human-visible label for the call (e.g. "Conference call").
+ * `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they spot they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
+ * State key must be blank.
+
+For instance:
+
+```json
+{
+    "type": "m.conf",
+    "state_key": "",
+    "content": {
+        "m.intent": "m.immersive",
+        "m.type": "m.voice",
+        "m.conf_id": "cvsiu2893",
+        "m.name": "Voice room",
+        "m.foci": [
+            "@sfu-lon:matrix.org",
+            "@sfu-nyc:matrix.org",
+        ],
+    }
+}
+```
+
+We mandate at most one call per room at any given point to avoid UX nightmares - if you want the user to participate in multiple parallel calls, you should simply create multiple rooms, each with one call.
+
+### Call participation
+
+Users who want to participate in the call declare this by adding an `m.conf` field to their `m.room.member` state event.  Ideally, we'd use a dedicated state event type for this, making it easier to rapidly spot who is in a conference.  But given we don't want other people editing our state event and Matrix doesn't yet provide that level of access control, instead we (ab)use the `m.room.member` event to declare our participation in the conference in the context of the room.  Therefore any profile updates need to be careful to preserve the `m.conf` field.
+
+The fields within the `m.conf` field are:
+
+ * `m.conf_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match the current `m.conf` event, it should be ignored.
+ * `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
+ * `m.sources` - Optionally, the user can list the various combinations of media streams they are able to send.  This is important if connecting to an SFU, as it lets the SFU know what simulcast resolutions the sender can send.  In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here.
+
+For instance:
+
+```json
+{
+    "type": "m.room.member",
+    "state_key": "@matthew:matrix.org",
+    "content": {
+        "avatar_url": "mxc://matrix.org/oUxxDyzQOHdVDMxgwFzyCWEe",
+        "displayname": "Matthew",
+        "membership": "join"
+        "m.conf": {
+            "m.conf_id": "cvsiu2893",
+            "m.foci": [
+                "@sfu-lon:matrix.org",
+                "@sfu-nyc:matrix.org",
+            ],
+            "m.sources": [
+                {
+                    "id": "qegwy64121wqw",
+                    "name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
+                    "device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
+                    "voice": [
+                        { "id": "zbhsbdhwe", "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
+                    ],
+                    "video": [
+                        { "id": "zbhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
+                        { "id": "zbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
+                    ],
+                    "mosaic": {}, // for composited video streams?
+                },
+                {
+                    "id": "suigv372y8378",
+                    "name": "Screenshare", // optional
+                    "device_id": "ASDUHDGFYUW",
+                    "video": [
+                        { "id": "xhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
+                        { "id": "xbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
+                    ]
+                },
+            ]
+        }
+    }
+}
+```
+
+XXX: properly specify the formats here (webrtc constraints perhaps)?  
+
+It's acceptable to advertise rigid formats here rather than dynamically negotiating resolution, bitrate etc, as in a group call we should just pick plausible desirable formats rather than try to please everyone.
+
+If a device loses connectivity, it is not particularly problematic that the membership data will be stale: all that will happen is that calls to the disconnected device will fail due to media or data-channel keepalive timeouts, and then subsequent attempts to call that device will fail.  Therefore (unlike the earlier demos) we don't need to spot timeouts by constantly re-posting the state event.
+
+### Call setup
+
+Call setup then uses the normal `m.call.*` events, except they are sent over to-device messages to the relevant devices (encrypted via Olm).  This means:
+
+ * When initiating a 1:1 call, the `m.call.invite` is sent to `*` devices of the intended target user.
+     * Once the user answers the call from the device, the sender should rescind the other pending to-device messages, ensuring that other devices don't get spammed about long-obsolete 1:1 calls.  XXX: We will need a way to rescind pending to-device msgs.
+     * Subsequent candidates and other events are sent only to the device who answered.
+     * XXX: do we still need MSC2746's `party_id` and `m.call.select_answer`?
+ * We will need to include the `m.conf_id` so that peers can map the call to the right room.
+ * However, especially for 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.conf` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
+ * When initiating a group call, we need to decide which devices to actually talk to.
+     * If the client has no SFU configured, we try to use the `m.foci` in the `m.conf` event.
+         * If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.
+         * If there are no `m.foci` in the `m.conf` event, then we look at which foci in `m.room.member` that are already in use by existing participants, and select the most common one.  (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
+         * If there are no `m.foci` in the `m.room.member`, then we connect full mesh.
+         * If subsequently `m.foci` are introduced into the conference, then we should transfer the call to them (effectively doing a 1:1->group call upgrade).
+     * If the client does have an SFU configured, then we decide whether to use it. 
+         * If other conf participants are already using it, then we use it.
+         * If there are other users from our homeserver in the conference, then we use it (as presumably they should be using it too)
+         * If there are no other `m.foci` (either in the `m.conf` or in the participant state) then we use it.
+         * Otherwise, we save bandwidth on our SFU by not cascading and instead behaving as if we had no SFU configured.
+
+TODO: spec how clients discover their homeserver's preferred SFU foci
+
+Originally this proposal suggested that foci should be identified by their `(user_id, device_id)` rather than just their user_id, in order to ensure convergence on the same device.  In practice, this is unnecessary complication if we make it the SFU implementor's problem to ensure that either only one device is logged in per SFU user - or instead you cluster the SFU devices together for the same user.  It's important to note that when calling an SFU you should call `*` devices.
+
+### SFU control
+
+SFUs are Selective Forwarding Units - a server which forwarding WebRTC streams between peers (which could be clients or SFUs or both).  To make use of them effectively, peers need to be able to tell the SFU which streams they want to receive, and the SFU must tell the peers which streams it wants to be sent.  We also need a way of telling SFUs which other SFUs to connect ("cascade") to.
+
+The client does this by establishing an optional datachannel connection to the SFU, in order to perform low-latency signalling to rapidly select streams.
+
+To select a stream over this channel, the peer sends:
+
+```json
+{
+    "op": "select",
+    "conf_id": "cvsiu2893",
+    "start": [
+        "zbhsbdhwe",
+        "zbhsbdhzs",
+    ],
+    "stop": [
+        "zbhsbdhxz",
+    ]    
+}
+```
+
+Rather than sending arrays one can send `"all"` to either `start` or `stop` to start or stop all streams.
+
+All streams are sent within a single media session (rather than us having multiple sessions or calls), and there is no difference between a peer sending simulcast streams from a webcam versus two SFUs trunking together.
+
+If no DC is established, then 1:1 calls should send all streams without prompting, and SFUs should send no streams by default.
+
+If you are using your SFU in a call, it will need to know how to connect to other SFUs present in order to participate in the fullmesh of SFU traffic (if any).  One option here is for SFUs to act as an AS and sniff the `m.room.member` traffic of their associated server, and automatically call any other `m.foci` which appear.  (They don't need to make outbound calls to clients, as clients always dial in).  Otherwise, we could consider an `"op": "connect"` command sent by clients, but then you have the problem of deciding which client(s) are responsible for reminding the SFU to connect to other SFUs.  Much better to trust the server.
+
+Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.conf` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.conf`.  (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)
+
+Finally, the DC transport is also used to detect connectivity timeouts more rapidly than webrtc's media timeout would allow, while avoiding clogging up the homeserver with keepalive traffic. This is done by each side sending a `"op": "ping"` packet every few seconds, and timing out the call if an `"op": "pong"` doesn't arrive within 5 seconds.
+
+XXX: define how these DC messages muxes with other traffic, and consider what message encoding to actually use.
+
+TODO: spell out how this works with active speaker detection & associated signalling
+
+## Encryption
+
+We get E2EE for 1:1 and full mesh calls automatically in this model.
+
+However, when SFUs are on the media path, the SFU will necessarily terminate the SRTP traffic from the peer, breaking E2EE.  To address this, we apply an additional end-to-end layer of encryption to the media using [WebRTC Encoded Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md) (formerly Insertable Streams) via [SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).
+
+In order to provide PFS, The symmetric key used for these stream from a given participating device is a megolm key. Unlike a normal megolm key, this is shared via `m.room_key` over Olm to the devices participating in the conference including an `m.conf_id` field on the key to correlate it to the conference traffic, rather than using the `session_id` event field to correlate (given the encrypted traffic is SRTP rather than events, and we don't want to have to send fake events from all senders every time the megolm session is replaced).
+
+The megolm key is ratcheted forward for every SFrame, and shared with new participants at the current index via `m.room_key` over Olm as per above.  When participants leave, a new megolm session is created and shared with all participants over Olm.  The new session is only used once all participants have received it.
+
+## Potential issues
+
+To-device messages are point-to-point between servers, whereas today's `m.call.*` messages can transitively traverse servers via the room DAG, thus working around federation problems.  In practice if you are relying on that behaviour, you're already in a bad place.
+
+The SFUs participating in a conference end up in a full mesh.  Rather than inventing our own spanning-tree system for SFUs however, we should fix it for Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or similar to decide what better-than-full-mesh topology to use.  In practice, full mesh cascade between SFUs is probably not that bad (especially if SFUs only request the streams over the trunk their clients care about) - and on aggregate will be less obnoxious than all the clients hitting a single SFU.
+
+SFrame mandates its own ratchet currently which is almost the same as megolm but not quite.  Switching it out for megolm seems reasonable right now (at least until MLS comes along)
+
+## Alternatives
+
+There are many many different ways to do this.  The main other alternative considered was not to use state events to track membership, but instead gossip it via either to-device or DC messages between participants.  This fell apart however due to trust: you effectively end up reinventing large parts of Matrix layered on top of to-device or DC.  So you might as well publish and distribute the participation data in the DAG rather than reinvent the wheel.
+
+Another option is to treat 1:1 (and full mesh) entirely differently to SFU based calling rather than trying to unify them.  Also, it's debatable whether supporting full mesh is useful at all.  In the end, it feels like unifying 1:1 and SFU calling is for the best though, as it then gives you the ability to trivially upgrade 1:1 calls to group calls and vice versa, and avoids maintaining two separate hunks of spec.  It also forces 1:1 calls to take multi-stream calls seriously, which is useful for more exotic capture devices (stereo cameras; 3D cameras; surround sound; audio fields etc).
+
+An alternative to to-device messages is to use DMs.  You still risk gappy sync problems though due to lots of traffic, as well as the hassle of creating DMs and requiring canonical DMs to set up the calls.  It does make debugging easier though, rather than having to track encrypted ephemeral to-device msgs.
+
+## Security considerations
+
+Malicious users could try to DoS SFUs by specifying them as their foci.
+
+SFrame E2EE may go horribly wrong if we can't send the new megolm session fast enough to all the participants when a participant leave (and meanwhile if we keep using the old session, we're technically leaking call media to the parted participant until we manage to rotate).
+
+Need to ensure there's no scope for media forwarding loops through SFUs.
+
+Malicious users in a room could try to sabotage a conference by overwriting the `m.conf` state event.
+
+Too many foci will chew bandwidth due to fullmesh between them.  In the worst case, if every use is on their own HS and picks a different foci, it degenerates to a fullmesh call (just serverside rather than clientside).  Hopefully this shouldn't happen as you will converge on using a single SFU with the most clients, but need to check how this works in practice.
+
+## Unstable prefix
+
+...
\ No newline at end of file

From 7f5ee49fd5424b91d521307c98dd67df380a7ca6 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Mon, 20 Sep 2021 01:03:07 +0100
Subject: [PATCH 02/24] comments & cosmetics

---
 proposals/3401-group-voip.md | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 31ad1e8230b..952d6258485 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -44,6 +44,8 @@ SFU (aka Focus):
                | 
                |
                C
+
+Where F is an SFU focus
 ```
 
 Cascaded decentralised SFU:
@@ -57,6 +59,8 @@ Cascaded decentralised SFU:
                Fc
               |  |
              C1  C2
+
+Where Fa, Fb and Fc are SFU foci, one per homeserver, each with two clients.
 ```
 
 ### m.conf state event
@@ -76,7 +80,7 @@ The user who wants to initiate a call sends a `m.conf` state event into the room
 
 For instance:
 
-```json
+```jsonc
 {
     "type": "m.conf",
     "state_key": "",
@@ -107,14 +111,14 @@ The fields within the `m.conf` field are:
 
 For instance:
 
-```json
+```jsonc
 {
     "type": "m.room.member",
     "state_key": "@matthew:matrix.org",
     "content": {
         "avatar_url": "mxc://matrix.org/oUxxDyzQOHdVDMxgwFzyCWEe",
         "displayname": "Matthew",
-        "membership": "join"
+        "membership": "join",
         "m.conf": {
             "m.conf_id": "cvsiu2893",
             "m.foci": [
@@ -190,7 +194,7 @@ The client does this by establishing an optional datachannel connection to the S
 
 To select a stream over this channel, the peer sends:
 
-```json
+```jsonc
 {
     "op": "select",
     "conf_id": "cvsiu2893",

From 083fd9a998e37485d8801509861190202b90a53a Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Mon, 20 Sep 2021 01:08:51 +0100
Subject: [PATCH 03/24] grammar

---
 proposals/3401-group-voip.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 952d6258485..17a44233f26 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -8,7 +8,7 @@ This has some limitations, especially if you try to broaden the approach to mult
  * VoIP signalling can generate a lot of events as candidates are incrementally discovered, and for rapid call setup these need to be relayed as rapidly as possible.
    * Putting these into the room timeline means that if the client has a gappy sync, for VoIP to be reliable it will need to go back and fill in the gap before it can process any VoIP events, slowing things down badly.
    * Timeline events are (currently) subject to harsh rate limiting, as they are assumed to be a spam vector.
- * VoIP signalling leak IP addresses.  There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
+ * VoIP signalling leaks IP addresses.  There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
  * Candidates are ephemeral data, and there is no reason to keep them around for posterity - they're just clogging up the DAG.
 
 Meanwhile we have no native signalling for group calls at all, forcing you to instead embed a separate system such as Jitsi, which has its own dependencies and doesn't directly leverage any of Matrix's encryption, decentralisation, access control or data model.

From 5ee96fb47c79e12c0a08d094a12634356b7d545f Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Thu, 23 Sep 2021 00:43:57 +0100
Subject: [PATCH 04/24] incorporate review

---
 proposals/3401-group-voip.md | 115 +++++++++++++++++------------------
 1 file changed, 56 insertions(+), 59 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 17a44233f26..bcb8ca9d333 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -63,31 +63,29 @@ Cascaded decentralised SFU:
 Where Fa, Fb and Fc are SFU foci, one per homeserver, each with two clients.
 ```
 
-### m.conf state event
+### m.call state event
 
-The user who wants to initiate a call sends a `m.conf` state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using `m.terminated`). Its body has the following fields:
+The user who wants to initiate a call sends a `m.call` state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using `m.terminated`). Its body has the following fields:
 
  * `m.intent` to describe the intended UX for handling the call.  One of:
      * `m.ring` if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)
      * `m.conference` is the call should be presented as a conference call which users in the room may connect to
-     * `m.immersive` if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
- * `m.type` to say whether the type of call is voice only (`m.voice`) or video (`m.video`)
- * `m.conf_id` as a unique identifier for the current ongoing call.  (We can't use the event ID, given `m.type` is mutable.  However, does this risk users causing problems with maliciously colliding IDs?).
- * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (do we need a duration, or can we figure that out from the previous state event?)
+     * `m.room` if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
+ * `m.type` to say whether the initial type of call is voice only (`m.voice`) or video (`m.video`).  This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.
+ * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).  
  * `m.name` as an optional human-visible label for the call (e.g. "Conference call").
  * `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they spot they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
- * State key must be blank.
+ * The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable).  If there are multiple non-termianted conf ID state events in the room, the client should display the most recently edited event.
 
 For instance:
 
 ```jsonc
 {
-    "type": "m.conf",
-    "state_key": "",
+    "type": "m.call",
+    "state_key": "cvsiu2893",
     "content": {
-        "m.intent": "m.immersive",
+        "m.intent": "m.room",
         "m.type": "m.voice",
-        "m.conf_id": "cvsiu2893",
         "m.name": "Voice room",
         "m.foci": [
             "@sfu-lon:matrix.org",
@@ -101,55 +99,54 @@ We mandate at most one call per room at any given point to avoid UX nightmares -
 
 ### Call participation
 
-Users who want to participate in the call declare this by adding an `m.conf` field to their `m.room.member` state event.  Ideally, we'd use a dedicated state event type for this, making it easier to rapidly spot who is in a conference.  But given we don't want other people editing our state event and Matrix doesn't yet provide that level of access control, instead we (ab)use the `m.room.member` event to declare our participation in the conference in the context of the room.  Therefore any profile updates need to be careful to preserve the `m.conf` field.
+Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it).  The event contains an array of `m.calls` object describing which calls the user is participating in within that room.  This array must contain one item (for now)>
 
-The fields within the `m.conf` field are:
+The fields within the item in the `m.calls` contents are:
 
- * `m.conf_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match the current `m.conf` event, it should be ignored.
+ * `m.call_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match an unterminated `m.call` event, it should be ignored.
  * `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
- * `m.sources` - Optionally, the user can list the various combinations of media streams they are able to send.  This is important if connecting to an SFU, as it lets the SFU know what simulcast resolutions the sender can send.  In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here.
+ * `m.sources` - Optionally, the user can list the various combinations of media streams they are able to send.  This is important if connecting to an SFU, as it lets the SFU know what simulcast resolutions the sender can send.  In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here.  If the conference has no SFU, this list defines the devices which other devices should connect to full-mesh in order to participate.
 
 For instance:
 
 ```jsonc
 {
-    "type": "m.room.member",
+    "type": "m.call.member",
     "state_key": "@matthew:matrix.org",
     "content": {
-        "avatar_url": "mxc://matrix.org/oUxxDyzQOHdVDMxgwFzyCWEe",
-        "displayname": "Matthew",
-        "membership": "join",
-        "m.conf": {
-            "m.conf_id": "cvsiu2893",
-            "m.foci": [
-                "@sfu-lon:matrix.org",
-                "@sfu-nyc:matrix.org",
-            ],
-            "m.sources": [
-                {
-                    "id": "qegwy64121wqw",
-                    "name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
-                    "device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
-                    "voice": [
-                        { "id": "zbhsbdhwe", "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
-                    ],
-                    "video": [
-                        { "id": "zbhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
-                        { "id": "zbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
-                    ],
-                    "mosaic": {}, // for composited video streams?
-                },
-                {
-                    "id": "suigv372y8378",
-                    "name": "Screenshare", // optional
-                    "device_id": "ASDUHDGFYUW",
-                    "video": [
-                        { "id": "xhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
-                        { "id": "xbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
-                    ]
-                },
-            ]
-        }
+        "m.calls": [
+            {
+                "m.call_id": "cvsiu2893",
+                "m.foci": [
+                    "@sfu-lon:matrix.org",
+                    "@sfu-nyc:matrix.org",
+                ],
+                "m.sources": [
+                    {
+                        "id": "qegwy64121wqw",
+                        "name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
+                        "device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
+                        "voice": [
+                            { "id": "zbhsbdhwe", "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
+                        ],
+                        "video": [
+                            { "id": "zbhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
+                            { "id": "zbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
+                        ],
+                        "mosaic": {}, // for composited video streams?
+                    },
+                    {
+                        "id": "suigv372y8378",
+                        "name": "Screenshare", // optional
+                        "device_id": "ASDUHDGFYUW",
+                        "video": [
+                            { "id": "xhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
+                            { "id": "xbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
+                        ]
+                    },
+                ]
+            }
+        ]
     }
 }
 ```
@@ -168,18 +165,18 @@ Call setup then uses the normal `m.call.*` events, except they are sent over to-
      * Once the user answers the call from the device, the sender should rescind the other pending to-device messages, ensuring that other devices don't get spammed about long-obsolete 1:1 calls.  XXX: We will need a way to rescind pending to-device msgs.
      * Subsequent candidates and other events are sent only to the device who answered.
      * XXX: do we still need MSC2746's `party_id` and `m.call.select_answer`?
- * We will need to include the `m.conf_id` so that peers can map the call to the right room.
- * However, especially for 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.conf` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
+ * We will need to include the `m.call_id` and room_id so that peers can map the call to the right room.
+ * However, especially for 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
  * When initiating a group call, we need to decide which devices to actually talk to.
-     * If the client has no SFU configured, we try to use the `m.foci` in the `m.conf` event.
+     * If the client has no SFU configured, we try to use the `m.foci` in the `m.call` event.
          * If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.
-         * If there are no `m.foci` in the `m.conf` event, then we look at which foci in `m.room.member` that are already in use by existing participants, and select the most common one.  (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
+         * If there are no `m.foci` in the `m.call` event, then we look at which foci in `m.room.member` that are already in use by existing participants, and select the most common one.  (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
          * If there are no `m.foci` in the `m.room.member`, then we connect full mesh.
          * If subsequently `m.foci` are introduced into the conference, then we should transfer the call to them (effectively doing a 1:1->group call upgrade).
      * If the client does have an SFU configured, then we decide whether to use it. 
          * If other conf participants are already using it, then we use it.
          * If there are other users from our homeserver in the conference, then we use it (as presumably they should be using it too)
-         * If there are no other `m.foci` (either in the `m.conf` or in the participant state) then we use it.
+         * If there are no other `m.foci` (either in the `m.call` or in the participant state) then we use it.
          * Otherwise, we save bandwidth on our SFU by not cascading and instead behaving as if we had no SFU configured.
 
 TODO: spec how clients discover their homeserver's preferred SFU foci
@@ -190,7 +187,7 @@ Originally this proposal suggested that foci should be identified by their `(use
 
 SFUs are Selective Forwarding Units - a server which forwarding WebRTC streams between peers (which could be clients or SFUs or both).  To make use of them effectively, peers need to be able to tell the SFU which streams they want to receive, and the SFU must tell the peers which streams it wants to be sent.  We also need a way of telling SFUs which other SFUs to connect ("cascade") to.
 
-The client does this by establishing an optional datachannel connection to the SFU, in order to perform low-latency signalling to rapidly select streams.
+The client does this by establishing an optional datachannel connection to the SFU using normal `m.call.invite`, in order to perform low-latency signalling to rapidly select streams.
 
 To select a stream over this channel, the peer sends:
 
@@ -216,7 +213,7 @@ If no DC is established, then 1:1 calls should send all streams without promptin
 
 If you are using your SFU in a call, it will need to know how to connect to other SFUs present in order to participate in the fullmesh of SFU traffic (if any).  One option here is for SFUs to act as an AS and sniff the `m.room.member` traffic of their associated server, and automatically call any other `m.foci` which appear.  (They don't need to make outbound calls to clients, as clients always dial in).  Otherwise, we could consider an `"op": "connect"` command sent by clients, but then you have the problem of deciding which client(s) are responsible for reminding the SFU to connect to other SFUs.  Much better to trust the server.
 
-Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.conf` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.conf`.  (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)
+Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.call` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.call`.  (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)
 
 Finally, the DC transport is also used to detect connectivity timeouts more rapidly than webrtc's media timeout would allow, while avoiding clogging up the homeserver with keepalive traffic. This is done by each side sending a `"op": "ping"` packet every few seconds, and timing out the call if an `"op": "pong"` doesn't arrive within 5 seconds.
 
@@ -230,7 +227,7 @@ We get E2EE for 1:1 and full mesh calls automatically in this model.
 
 However, when SFUs are on the media path, the SFU will necessarily terminate the SRTP traffic from the peer, breaking E2EE.  To address this, we apply an additional end-to-end layer of encryption to the media using [WebRTC Encoded Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md) (formerly Insertable Streams) via [SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).
 
-In order to provide PFS, The symmetric key used for these stream from a given participating device is a megolm key. Unlike a normal megolm key, this is shared via `m.room_key` over Olm to the devices participating in the conference including an `m.conf_id` field on the key to correlate it to the conference traffic, rather than using the `session_id` event field to correlate (given the encrypted traffic is SRTP rather than events, and we don't want to have to send fake events from all senders every time the megolm session is replaced).
+In order to provide PFS, The symmetric key used for these stream from a given participating device is a megolm key. Unlike a normal megolm key, this is shared via `m.room_key` over Olm to the devices participating in the conference including an `m.call_id` and `m.room_id` field on the key to correlate it to the conference traffic, rather than using the `session_id` event field to correlate (given the encrypted traffic is SRTP rather than events, and we don't want to have to send fake events from all senders every time the megolm session is replaced).
 
 The megolm key is ratcheted forward for every SFrame, and shared with new participants at the current index via `m.room_key` over Olm as per above.  When participants leave, a new megolm session is created and shared with all participants over Olm.  The new session is only used once all participants have received it.
 
@@ -258,7 +255,7 @@ SFrame E2EE may go horribly wrong if we can't send the new megolm session fast e
 
 Need to ensure there's no scope for media forwarding loops through SFUs.
 
-Malicious users in a room could try to sabotage a conference by overwriting the `m.conf` state event.
+Malicious users in a room could try to sabotage a conference by overwriting the `m.call` state event of the current ongoing call.
 
 Too many foci will chew bandwidth due to fullmesh between them.  In the worst case, if every use is on their own HS and picks a different foci, it degenerates to a fullmesh call (just serverside rather than clientside).  Hopefully this shouldn't happen as you will converge on using a single SFU with the most clients, but need to check how this works in practice.
 

From b90b85eb0bfdd38bf0a17adb048deba2e8cfc557 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Thu, 23 Sep 2021 01:04:50 +0100
Subject: [PATCH 05/24] more feedback

---
 proposals/3401-group-voip.md | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index bcb8ca9d333..40b645ca9ba 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -69,12 +69,12 @@ The user who wants to initiate a call sends a `m.call` state event into the room
 
  * `m.intent` to describe the intended UX for handling the call.  One of:
      * `m.ring` if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)
-     * `m.conference` is the call should be presented as a conference call which users in the room may connect to
+     * `m.prompt` is the call should be presented as a conference call which users in the room are prompted to connect to
      * `m.room` if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
  * `m.type` to say whether the initial type of call is voice only (`m.voice`) or video (`m.video`).  This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.
  * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).  
  * `m.name` as an optional human-visible label for the call (e.g. "Conference call").
- * `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they spot they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
+ * `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
  * The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable).  If there are multiple non-termianted conf ID state events in the room, the client should display the most recently edited event.
 
 For instance:
@@ -126,7 +126,7 @@ For instance:
                         "id": "qegwy64121wqw",
                         "name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
                         "device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
-                        "voice": [
+                        "audio": [
                             { "id": "zbhsbdhwe", "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
                         ],
                         "video": [
@@ -170,14 +170,15 @@ Call setup then uses the normal `m.call.*` events, except they are sent over to-
  * When initiating a group call, we need to decide which devices to actually talk to.
      * If the client has no SFU configured, we try to use the `m.foci` in the `m.call` event.
          * If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.
-         * If there are no `m.foci` in the `m.call` event, then we look at which foci in `m.room.member` that are already in use by existing participants, and select the most common one.  (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
-         * If there are no `m.foci` in the `m.room.member`, then we connect full mesh.
+         * If there are no `m.foci` in the `m.call` event, then we look at which foci in `m.call.member` that are already in use by existing participants, and select the most common one.  (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
+         * If there are no `m.foci` in the `m.call.member`, then we connect full mesh.
          * If subsequently `m.foci` are introduced into the conference, then we should transfer the call to them (effectively doing a 1:1->group call upgrade).
      * If the client does have an SFU configured, then we decide whether to use it. 
          * If other conf participants are already using it, then we use it.
          * If there are other users from our homeserver in the conference, then we use it (as presumably they should be using it too)
          * If there are no other `m.foci` (either in the `m.call` or in the participant state) then we use it.
          * Otherwise, we save bandwidth on our SFU by not cascading and instead behaving as if we had no SFU configured.
+ * We do not recommend that users utilise an SFU to hide behind for privacy, but instead use a TURN server, only providing relay candidates, rather than consuming SFU resources and unnecessarily mandating the presence of an SFU.
 
 TODO: spec how clients discover their homeserver's preferred SFU foci
 
@@ -205,13 +206,11 @@ To select a stream over this channel, the peer sends:
 }
 ```
 
-Rather than sending arrays one can send `"all"` to either `start` or `stop` to start or stop all streams.
-
 All streams are sent within a single media session (rather than us having multiple sessions or calls), and there is no difference between a peer sending simulcast streams from a webcam versus two SFUs trunking together.
 
 If no DC is established, then 1:1 calls should send all streams without prompting, and SFUs should send no streams by default.
 
-If you are using your SFU in a call, it will need to know how to connect to other SFUs present in order to participate in the fullmesh of SFU traffic (if any).  One option here is for SFUs to act as an AS and sniff the `m.room.member` traffic of their associated server, and automatically call any other `m.foci` which appear.  (They don't need to make outbound calls to clients, as clients always dial in).  Otherwise, we could consider an `"op": "connect"` command sent by clients, but then you have the problem of deciding which client(s) are responsible for reminding the SFU to connect to other SFUs.  Much better to trust the server.
+If you are using your SFU in a call, it will need to know how to connect to other SFUs present in order to participate in the fullmesh of SFU traffic (if any).  One option here is for SFUs to act as an AS and sniff the `m.call.member` traffic of their associated server, and automatically call any other `m.foci` which appear.  (They don't need to make outbound calls to clients, as clients always dial in).  Otherwise, we could consider an `"op": "connect"` command sent by clients, but then you have the problem of deciding which client(s) are responsible for reminding the SFU to connect to other SFUs.  Much better to trust the server.
 
 Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.call` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.call`.  (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)
 
@@ -249,6 +248,8 @@ An alternative to to-device messages is to use DMs.  You still risk gappy sync p
 
 ## Security considerations
 
+State events are not encrypted currently, and so this leaks that a call is happening, and who is participating in it, and from which devices.
+
 Malicious users could try to DoS SFUs by specifying them as their foci.
 
 SFrame E2EE may go horribly wrong if we can't send the new megolm session fast enough to all the participants when a participant leave (and meanwhile if we keep using the old session, we're technically leaking call media to the parted participant until we manage to rotate).

From ed37a0dc7428f4f4510cc9124fc58bd5e760cab1 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Thu, 23 Sep 2021 01:08:48 +0100
Subject: [PATCH 06/24] add `purpose` from #3077

---
 proposals/3401-group-voip.md | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 40b645ca9ba..bc8c16617f3 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -127,11 +127,22 @@ For instance:
                         "name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
                         "device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
                         "audio": [
-                            { "id": "zbhsbdhwe", "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
+                            {
+                                "id": "zbhsbdhwe",
+                                "purpose": "m.usermedia",
+                                "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
                         ],
                         "video": [
-                            { "id": "zbhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
-                            { "id": "zbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
+                            {
+                                "id": "zbhsbdhzs", 
+                                "purpose": "m.usermedia", 
+                                "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } 
+                            },
+                            { 
+                                "id": "zbhsbdhzx", 
+                                "purpose": "m.usermedia", 
+                                "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 }
+                            },
                         ],
                         "mosaic": {}, // for composited video streams?
                     },
@@ -140,8 +151,16 @@ For instance:
                         "name": "Screenshare", // optional
                         "device_id": "ASDUHDGFYUW",
                         "video": [
-                            { "id": "xhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
-                            { "id": "xbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
+                            { 
+                                "id": "xhsbdhzs",
+                                "purpose": "m.screenshare", 
+                                "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 }
+                            },
+                            { 
+                                "id": "xbhsbdhzx",
+                                "purpose": "m.screenshare", 
+                                "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 }
+                            },
                         ]
                     },
                 ]

From 33a64f2921ae82fb093412d9585af7f0d877a1f9 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Thu, 23 Sep 2021 22:33:12 +0100
Subject: [PATCH 07/24] Update proposals/3401-group-voip.md

typo

Co-authored-by: Jonathan de Jong <jonathandejong02@gmail.com>
---
 proposals/3401-group-voip.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index bc8c16617f3..0d5533e0361 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -75,7 +75,7 @@ The user who wants to initiate a call sends a `m.call` state event into the room
  * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).  
  * `m.name` as an optional human-visible label for the call (e.g. "Conference call").
  * `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
- * The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable).  If there are multiple non-termianted conf ID state events in the room, the client should display the most recently edited event.
+ * The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable).  If there are multiple non-terminated conf ID state events in the room, the client should display the most recently edited event.
 
 For instance:
 

From 7fd1ba6daeff53dbaba5564232654cb66c7488bd Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Sat, 25 Sep 2021 11:47:57 +0100
Subject: [PATCH 08/24] converge  better with #3077 and WebRTC norms

---
 proposals/3401-group-voip.md | 55 +++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 19 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 0d5533e0361..a420ffde2dd 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -105,7 +105,7 @@ The fields within the item in the `m.calls` contents are:
 
  * `m.call_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match an unterminated `m.call` event, it should be ignored.
  * `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
- * `m.sources` - Optionally, the user can list the various combinations of media streams they are able to send.  This is important if connecting to an SFU, as it lets the SFU know what simulcast resolutions the sender can send.  In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here.  If the conference has no SFU, this list defines the devices which other devices should connect to full-mesh in order to participate.
+ * `m.sources` - Optionally, the user can list the various media streams (and tracks within the streams) they are able to send.  This is important if connecting to an SFU, as it lets the SFU know what simulcast tracks the sender can send.  In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here.  If the conference has no SFU, this list defines the devices which other devices should connect to full-mesh in order to participate.
 
 For instance:
 
@@ -123,25 +123,40 @@ For instance:
                 ],
                 "m.sources": [
                     {
-                        "id": "qegwy64121wqw",
+                        "id": "qegwy64121wqw", // WebRTC MediaStream id
+                        "purpose": "m.usermedia",
                         "name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
                         "device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
                         "audio": [
                             {
-                                "id": "zbhsbdhwe",
-                                "purpose": "m.usermedia",
-                                "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
+                                "id": "zbhsbdhwe", // WebRTC MediaStreamTrack id
+                                "settings": { // WebRTC MediaTrackSettings object
+                                    "channelCount": 2,
+                                    "sampleRate": 48000,
+                                    "m.maxbr": 32000, // Matrix-specific extension to advertise the max bitrate of this track
+                                }
+                            },
                         ],
                         "video": [
                             {
                                 "id": "zbhsbdhzs", 
-                                "purpose": "m.usermedia", 
-                                "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } 
+                                "settings": {
+                                    "width": 1280,
+                                    "height": 720,
+                                    "facingMode": "user",
+                                    "frameRate": 30.0,
+                                    "m.maxbr": 512000,
+                                } 
                             },
                             { 
                                 "id": "zbhsbdhzx", 
-                                "purpose": "m.usermedia", 
-                                "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 }
+                                "settings": {
+                                    "width": 320,
+                                    "height": 240,
+                                    "facingMode": "user",
+                                    "frameRate": 15.0,
+                                    "m.maxbr": 64000,
+                                } 
                             },
                         ],
                         "mosaic": {}, // for composited video streams?
@@ -149,17 +164,19 @@ For instance:
                     {
                         "id": "suigv372y8378",
                         "name": "Screenshare", // optional
+                        "purpose": "m.screenshare", 
                         "device_id": "ASDUHDGFYUW",
                         "video": [
-                            { 
-                                "id": "xhsbdhzs",
-                                "purpose": "m.screenshare", 
-                                "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 }
-                            },
-                            { 
-                                "id": "xbhsbdhzx",
-                                "purpose": "m.screenshare", 
-                                "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 }
+                            {
+                                "id": "xbhsbdhzs", 
+                                "settings": {
+                                    "width": 3072,
+                                    "height": 1920,
+                                    "cursor": "moving",
+                                    "displaySurface": "monitor",
+                                    "frameRate": 30.0,
+                                    "m.maxbr": 768000,
+                                } 
                             },
                         ]
                     },
@@ -170,7 +187,7 @@ For instance:
 }
 ```
 
-XXX: properly specify the formats here (webrtc constraints perhaps)?  
+This builds on MSC #3077, which describes streams in `m.call.*` events via a `sdp_stream_metadata` field, but providing the full set of information needed for all devices in the room to know what streams are available in the group call without having to independently discover them from the SFU.
 
 It's acceptable to advertise rigid formats here rather than dynamically negotiating resolution, bitrate etc, as in a group call we should just pick plausible desirable formats rather than try to please everyone.
 

From 669d471d965946292da808134c60f29a781db13b Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Sat, 25 Sep 2021 11:54:27 +0100
Subject: [PATCH 09/24] tracks have to be identified by stream + track tuple

---
 proposals/3401-group-voip.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index a420ffde2dd..4259b980fe9 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -233,11 +233,11 @@ To select a stream over this channel, the peer sends:
     "op": "select",
     "conf_id": "cvsiu2893",
     "start": [
-        "zbhsbdhwe",
-        "zbhsbdhzs",
+        { "stream_id": "qegwy64121wqw", "track_id": "zbhsbdhwe" }
+        { "stream_id": "qegwy64121wqw", "track_id": "zbhsbdhzs" }
     ],
     "stop": [
-        "zbhsbdhxz",
+        { "stream_id": "suigv372y8378", "track_id": "xbhsbdhzs" }
     ]    
 }
 ```

From 48526ade65b677474ae703ee39f525c9701f8b57 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Wed, 13 Oct 2021 00:57:43 +0100
Subject: [PATCH 10/24] spell out that you should ignore `m.call.member` state
 events from parted users

---
 proposals/3401-group-voip.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 4259b980fe9..cd609570c16 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -99,7 +99,9 @@ We mandate at most one call per room at any given point to avoid UX nightmares -
 
 ### Call participation
 
-Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it).  The event contains an array of `m.calls` object describing which calls the user is participating in within that room.  This array must contain one item (for now)>
+Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it).  The event contains an array of `m.calls` object describing which calls the user is participating in within that room.  This array must contain one item (for now).
+
+`m.call.member` state events must be ignored if their user's `m.room.member` event's membership field is not `join`.
 
 The fields within the item in the `m.calls` contents are:
 
@@ -298,4 +300,4 @@ Too many foci will chew bandwidth due to fullmesh between them.  In the worst ca
 
 ## Unstable prefix
 
-...
\ No newline at end of file
+...

From dfd4ffe09ab55c2cabc6fef0875e0a8457067918 Mon Sep 17 00:00:00 2001
From: Robert Long <robert@robertlong.me>
Date: Wed, 9 Mar 2022 14:43:45 -0800
Subject: [PATCH 11/24] Add basic call sequence diagram

---
 proposals/3401-group-voip.md | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index cd609570c16..c49c55769ed 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -258,6 +258,35 @@ XXX: define how these DC messages muxes with other traffic, and consider what me
 
 TODO: spell out how this works with active speaker detection & associated signalling
 
+## Example Diagrams
+
+**Legend**
+
+| Arrow Style | Description |
+|-------------|-------------|
+| Solid | [State Event](https://spec.matrix.org/latest/client-server-api/#types-of-room-events) |
+| Dashed | [Event (sent as to-device message)](https://spec.matrix.org/latest/client-server-api/#send-to-device-messaging) |
+
+
+### Basic Call
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant Alice
+    participant Room
+    participant Bob
+    Alice->>Room: m.call
+    Alice->>Room: m.call.member
+    Bob->>Room: m.call.member
+    Alice-->>Bob: m.call.invite
+    Alice-->>Bob: m.call.candidates
+    Alice-->>Bob: m.call.candidates
+    Bob-->>Alice: m.call.answer
+    Bob-->>Alice: m.call.candidates
+    Alice-->>Bob: m.call.select_answer
+```
+
 ## Encryption
 
 We get E2EE for 1:1 and full mesh calls automatically in this model.

From 3c306cc5fbeb151a53b938ee85a6ecec2d21bba5 Mon Sep 17 00:00:00 2001
From: Robert Long <robert@robertlong.me>
Date: Wed, 9 Mar 2022 14:47:37 -0800
Subject: [PATCH 12/24] Remove SFU datachannel ping/pong timeout section

---
 proposals/3401-group-voip.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index c49c55769ed..cc7f6e02dda 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -252,8 +252,6 @@ If you are using your SFU in a call, it will need to know how to connect to othe
 
 Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.call` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.call`.  (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)
 
-Finally, the DC transport is also used to detect connectivity timeouts more rapidly than webrtc's media timeout would allow, while avoiding clogging up the homeserver with keepalive traffic. This is done by each side sending a `"op": "ping"` packet every few seconds, and timing out the call if an `"op": "pong"` doesn't arrive within 5 seconds.
-
 XXX: define how these DC messages muxes with other traffic, and consider what message encoding to actually use.
 
 TODO: spell out how this works with active speaker detection & associated signalling

From 4d43aae3b105b0bf5193c4670dcba91c093e5458 Mon Sep 17 00:00:00 2001
From: Robert Long <robert@robertlong.me>
Date: Wed, 9 Mar 2022 19:13:40 -0800
Subject: [PATCH 13/24] Update m.call.member and call setup sections

---
 proposals/3401-group-voip.md | 93 +++++++++++-------------------------
 1 file changed, 29 insertions(+), 64 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index cc7f6e02dda..6b4d86aa0e6 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -107,7 +107,11 @@ The fields within the item in the `m.calls` contents are:
 
  * `m.call_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match an unterminated `m.call` event, it should be ignored.
  * `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
- * `m.sources` - Optionally, the user can list the various media streams (and tracks within the streams) they are able to send.  This is important if connecting to an SFU, as it lets the SFU know what simulcast tracks the sender can send.  In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here.  If the conference has no SFU, this list defines the devices which other devices should connect to full-mesh in order to participate.
+ * `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
+   * `device_id` - The device id to use for to-device messages when establishing a call
+   * `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
+   * `feeds` - Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.
+     * `purpose` - Either `m.usermedia` or `m.screenshare` otherwise the feed should be ignored.
 
 For instance:
 
@@ -123,65 +127,23 @@ For instance:
                     "@sfu-lon:matrix.org",
                     "@sfu-nyc:matrix.org",
                 ],
-                "m.sources": [
+                "m.devices": [
                     {
-                        "id": "qegwy64121wqw", // WebRTC MediaStream id
-                        "purpose": "m.usermedia",
-                        "name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
-                        "device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
-                        "audio": [
+                        "device_id": "ASDUHDGFYUW", // Used to target to-device messages
+                        "session_id": "GHKJFKLJLJ", // Used to resolve duplicate calls from a device
+                        "feeds": [
                             {
-                                "id": "zbhsbdhwe", // WebRTC MediaStreamTrack id
-                                "settings": { // WebRTC MediaTrackSettings object
-                                    "channelCount": 2,
-                                    "sampleRate": 48000,
-                                    "m.maxbr": 32000, // Matrix-specific extension to advertise the max bitrate of this track
-                                }
+                                "purpose": "m.usermedia"
+                                // TODO: Add tracks
+                                // TODO: Available bitrates etc. should be listed here
                             },
-                        ],
-                        "video": [
                             {
-                                "id": "zbhsbdhzs", 
-                                "settings": {
-                                    "width": 1280,
-                                    "height": 720,
-                                    "facingMode": "user",
-                                    "frameRate": 30.0,
-                                    "m.maxbr": 512000,
-                                } 
-                            },
-                            { 
-                                "id": "zbhsbdhzx", 
-                                "settings": {
-                                    "width": 320,
-                                    "height": 240,
-                                    "facingMode": "user",
-                                    "frameRate": 15.0,
-                                    "m.maxbr": 64000,
-                                } 
-                            },
-                        ],
-                        "mosaic": {}, // for composited video streams?
-                    },
-                    {
-                        "id": "suigv372y8378",
-                        "name": "Screenshare", // optional
-                        "purpose": "m.screenshare", 
-                        "device_id": "ASDUHDGFYUW",
-                        "video": [
-                            {
-                                "id": "xbhsbdhzs", 
-                                "settings": {
-                                    "width": 3072,
-                                    "height": 1920,
-                                    "cursor": "moving",
-                                    "displaySurface": "monitor",
-                                    "frameRate": 30.0,
-                                    "m.maxbr": 768000,
-                                } 
-                            },
+                                "purpose": "m.screenshare"
+                                // TODO: Add tracks
+                                // TODO: Available bitrates etc. should be listed here
+                            }
                         ]
-                    },
+                    }
                 ]
             }
         ]
@@ -189,22 +151,25 @@ For instance:
 }
 ```
 
-This builds on MSC #3077, which describes streams in `m.call.*` events via a `sdp_stream_metadata` field, but providing the full set of information needed for all devices in the room to know what streams are available in the group call without having to independently discover them from the SFU.
+This builds on MSC #3077, which describes streams in `m.call.*` events via a `sdp_stream_metadata` field, but providing the full set of information needed for all devices in the room to know what feeds are available in the group call without having to independently discover them from the SFU.
 
-It's acceptable to advertise rigid formats here rather than dynamically negotiating resolution, bitrate etc, as in a group call we should just pick plausible desirable formats rather than try to please everyone.
+** TODO: Add tracks field **
+** TODO: Add bitrate/format fields **
 
-If a device loses connectivity, it is not particularly problematic that the membership data will be stale: all that will happen is that calls to the disconnected device will fail due to media or data-channel keepalive timeouts, and then subsequent attempts to call that device will fail.  Therefore (unlike the earlier demos) we don't need to spot timeouts by constantly re-posting the state event.
+Clients should do their best to ensure that calls in `m.call.member` state are removed when the member leaves the call. However, there will be cases where the device loses network connectivity, power, the application is forced closed, or it crashes. If the `m.call.member` state has stale device data the call setup will fail. Clients should re-attempt invites up to 3 times before giving up on calling a member.
 
 ### Call setup
 
 Call setup then uses the normal `m.call.*` events, except they are sent over to-device messages to the relevant devices (encrypted via Olm).  This means:
 
- * When initiating a 1:1 call, the `m.call.invite` is sent to `*` devices of the intended target user.
-     * Once the user answers the call from the device, the sender should rescind the other pending to-device messages, ensuring that other devices don't get spammed about long-obsolete 1:1 calls.  XXX: We will need a way to rescind pending to-device msgs.
-     * Subsequent candidates and other events are sent only to the device who answered.
-     * XXX: do we still need MSC2746's `party_id` and `m.call.select_answer`?
- * We will need to include the `m.call_id` and room_id so that peers can map the call to the right room.
- * However, especially for 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
+ * When initiating a 1:1 call, the `m.call.invite` is sent to the devices listed in `m.call.member` event's `m.devices` array using the `device_id` field.
+ * `m.call.*` events sent via to-device messages should also include the following properties in their content:
+   * `conf_id` - The group call id listed in `m.call`
+   * `dest_session_id` - The recipient's session id. Incoming messages with a `dest_session_id` that doesn't match your current session id should be discarded.
+ * In addition to the fields above `m.call.invite` events sent via to-device messages should include the following properties  :
+   * `device_id` - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the `m.call.member` event has not been received yet.
+   * `sender_session_id` - Like the `device_id` the `sender_session_id` is used by the opponent member to filter out messages unrelated to the sender's session even if the `m.call.member` event has not been received yet.
+ * For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
  * When initiating a group call, we need to decide which devices to actually talk to.
      * If the client has no SFU configured, we try to use the `m.foci` in the `m.call` event.
          * If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.

From 856ddc75701e1af3dbd51c407efcfdcf1a5c0ea7 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Sat, 28 May 2022 11:49:19 +0100
Subject: [PATCH 14/24] spell out the unstable prefix

---
 proposals/3401-group-voip.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 6b4d86aa0e6..898392c3517 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -292,4 +292,7 @@ Too many foci will chew bandwidth due to fullmesh between them.  In the worst ca
 
 ## Unstable prefix
 
-...
+| stable event type | unstable event type |
+|-------------------|---------------------|
+| m.call            | org.matrix.msc3401.call |
+| m.call.member     | org.matrix.msc3401.call.member |

From d109b5431fd081b10182659112e676a98dce2d66 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Tue, 31 May 2022 00:13:30 +0100
Subject: [PATCH 15/24] add tracks back into m.call.member for SFUs to use

---
 proposals/3401-group-voip.md | 43 +++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 898392c3517..9c32fa59b02 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -133,14 +133,45 @@ For instance:
                         "session_id": "GHKJFKLJLJ", // Used to resolve duplicate calls from a device
                         "feeds": [
                             {
-                                "purpose": "m.usermedia"
-                                // TODO: Add tracks
-                                // TODO: Available bitrates etc. should be listed here
+                                "purpose": "m.usermedia",
+                                "tracks": [
+                                    {
+                                        "type": "audio",
+                                        "id": "zvhjiwqsx", // WebRTC MediaStreamTrack id
+                                        "settings": { // WebRTC MediaTrackSettings object
+                                            "channelCount": 2,
+                                            "sampleRate": 48000,
+                                            "m.maxbr": 32000, // Matrix-specific extension to advertise the max bitrate of this track
+                                        }
+                                    },
+                                    {
+                                        "type": "video",
+                                        "id": "zbhsbdhzs",
+                                        "settings": {
+                                            "width": 1280,
+                                            "height": 720,
+                                            "facingMode": "user",
+                                            "frameRate": 30.0,
+                                            "m.maxbr": 512000,
+                                        }
+                                    },
+                                ],
                             },
                             {
-                                "purpose": "m.screenshare"
-                                // TODO: Add tracks
-                                // TODO: Available bitrates etc. should be listed here
+                                "purpose": "m.screenshare",
+                                "tracks": [
+                                    {
+                                        "id": "xbhsbdhzs",
+                                        "settings": {
+                                            "width": 3072,
+                                            "height": 1920,
+                                            "cursor": "moving",
+                                            "displaySurface": "monitor",
+                                            "frameRate": 30.0,
+                                            "m.maxbr": 768000,
+                                        }
+                                    },
+                                ]
                             }
                         ]
                     }

From 07f95477a06dda4891d38ef307fbffa49ead96a6 Mon Sep 17 00:00:00 2001
From: Matthew Hodgson <matthew@matrix.org>
Date: Fri, 3 Jun 2022 18:19:38 +0100
Subject: [PATCH 16/24] add session IDs & labels

---
 proposals/3401-group-voip.md | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 9c32fa59b02..a9534a4678c 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -134,10 +134,12 @@ For instance:
                         "feeds": [
                             {
                                 "purpose": "m.usermedia",
+                                "id": "qegwy64121wqw", // WebRTC MediaStream id
                                 "tracks": [
                                     {
-                                        "type": "audio",
+                                        "kind": "audio",
                                         "id": "zvhjiwqsx", // WebRTC MediaStreamTrack id
+                                        "label": "Sennheiser Mic",
                                         "settings": { // WebRTC MediaTrackSettings object
                                             "channelCount": 2,
                                             "sampleRate": 48000,
@@ -145,8 +147,9 @@ For instance:
                                         }
                                     },
                                     {
-                                        "type": "video",
+                                        "kind": "video",
                                         "id": "zbhsbdhzs",
+                                        "label": "Logitech Webcam",
                                         "settings": {
                                             "width": 1280,
                                             "height": 720,
@@ -159,9 +162,12 @@ For instance:
                             },
                             {
                                 "purpose": "m.screenshare",
+                                "id": "suigv372y8378",
                                 "tracks": [
                                     {
+                                        "kind": "video",
                                         "id": "xbhsbdhzs",
+                                        "label": "My Screenshare",
                                         "settings": {
                                             "width": 3072,
                                             "height": 1920,

From 7a06ed7897586e364bb55a05e2515bf5cef4083d Mon Sep 17 00:00:00 2001
From: Robin <robin@robin.town>
Date: Thu, 16 Jun 2022 09:36:54 -0400
Subject: [PATCH 17/24] Let call member events expire (#3831)

* Expire m.call.member events after 1 hour

* Allow clients to choose their own timeouts
---
 proposals/3401-group-voip.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index a9534a4678c..14de0ae9a46 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -99,9 +99,11 @@ We mandate at most one call per room at any given point to avoid UX nightmares -
 
 ### Call participation
 
-Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it).  The event contains an array of `m.calls` object describing which calls the user is participating in within that room.  This array must contain one item (for now).
+Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it).  The event contains a timestamp named `m.expires_ts` describing when this data should be considered stale, and an array `m.calls` of objects describing which calls the user is participating in within that room.  This array must contain one item (for now).
 
-`m.call.member` state events must be ignored if their user's `m.room.member` event's membership field is not `join`.
+When sending an `m.call.member` event, clients should choose a reasonable value for `m.expires_ts` in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp.
+
+`m.call.member` state events must be ignored if the `m.expires_ts` field indicates they have expired, or if their user's `m.room.member` event's membership field is not `join`.
 
 The fields within the item in the `m.calls` contents are:
 
@@ -183,7 +185,8 @@ For instance:
                     }
                 ]
             }
-        ]
+        ],
+        "m.expires_ts":  1654616071686
     }
 }
 ```

From 32f566aa6559a8501ed17ec47f11a5929cf177c8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C5=A0imon=20Brandner?= <simon.bra.ag@gmail.com>
Date: Fri, 21 Oct 2022 15:51:21 +0200
Subject: [PATCH 18/24] Rip out SFU bits out of MSC3401 (#3897)

---
 proposals/3401-group-voip.md | 103 +++++------------------------------
 1 file changed, 14 insertions(+), 89 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 14de0ae9a46..f1522bc570e 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -1,5 +1,9 @@
 # MSC3401: Native Group VoIP signalling
 
+Note: previously this MSC included SFU signalling which has now been moved to
+[MSC3898](https://github.com/matrix-org/matrix-spec-proposals/pull/3898) to
+avoid making this MSC too large.
+
 ## Problem
 
 VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room.
@@ -15,7 +19,11 @@ Meanwhile we have no native signalling for group calls at all, forcing you to in
 
 ## Proposal
 
-This proposal provides a signalling framework using to-device messages which can be applied to native Matrix 1:1 calls, full-mesh calls, SFU calls, cascaded SFU calls and in future MCU calls, and hybrid SFU/MCU approaches. It replaces the early flawed sketch at [MSC2359](https://github.com/matrix-org/matrix-doc/pull/2359).
+This proposal provides a signalling framework using to-device messages which can
+be applied to native Matrix 1:1 calls, full-mesh calls and in the future SFU
+calls, cascaded SFU calls MCU calls, and hybrid SFU/MCU approaches. It replaces
+the early flawed sketch at
+[MSC2359](https://github.com/matrix-org/matrix-doc/pull/2359).
 
 This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.
 
@@ -44,7 +52,6 @@ SFU (aka Focus):
                | 
                |
                C
-
 Where F is an SFU focus
 ```
 
@@ -74,7 +81,6 @@ The user who wants to initiate a call sends a `m.call` state event into the room
  * `m.type` to say whether the initial type of call is voice only (`m.voice`) or video (`m.video`).  This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.
  * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).  
  * `m.name` as an optional human-visible label for the call (e.g. "Conference call").
- * `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
  * The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable).  If there are multiple non-terminated conf ID state events in the room, the client should display the most recently edited event.
 
 For instance:
@@ -86,11 +92,7 @@ For instance:
     "content": {
         "m.intent": "m.room",
         "m.type": "m.voice",
-        "m.name": "Voice room",
-        "m.foci": [
-            "@sfu-lon:matrix.org",
-            "@sfu-nyc:matrix.org",
-        ],
+        "m.name": "Voice room"
     }
 }
 ```
@@ -108,7 +110,6 @@ When sending an `m.call.member` event, clients should choose a reasonable value
 The fields within the item in the `m.calls` contents are:
 
  * `m.call_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match an unterminated `m.call` event, it should be ignored.
- * `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
  * `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
    * `device_id` - The device id to use for to-device messages when establishing a call
    * `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
@@ -125,10 +126,6 @@ For instance:
         "m.calls": [
             {
                 "m.call_id": "cvsiu2893",
-                "m.foci": [
-                    "@sfu-lon:matrix.org",
-                    "@sfu-nyc:matrix.org",
-                ],
                 "m.devices": [
                     {
                         "device_id": "ASDUHDGFYUW", // Used to target to-device messages
@@ -191,8 +188,11 @@ For instance:
 }
 ```
 
-This builds on MSC #3077, which describes streams in `m.call.*` events via a `sdp_stream_metadata` field, but providing the full set of information needed for all devices in the room to know what feeds are available in the group call without having to independently discover them from the SFU.
+This builds on [MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077), which describes streams in `m.call.*` events via a `sdp_stream_metadata` field.
 
+** TODO: Do we need all of this data? Why would we need it? **
+** TODO: This doesn't follow the MSC3077 format very well - can we do something
+about that? **
 ** TODO: Add tracks field **
 ** TODO: Add bitrate/format fields **
 
@@ -210,56 +210,6 @@ Call setup then uses the normal `m.call.*` events, except they are sent over to-
    * `device_id` - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the `m.call.member` event has not been received yet.
    * `sender_session_id` - Like the `device_id` the `sender_session_id` is used by the opponent member to filter out messages unrelated to the sender's session even if the `m.call.member` event has not been received yet.
  * For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
- * When initiating a group call, we need to decide which devices to actually talk to.
-     * If the client has no SFU configured, we try to use the `m.foci` in the `m.call` event.
-         * If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.
-         * If there are no `m.foci` in the `m.call` event, then we look at which foci in `m.call.member` that are already in use by existing participants, and select the most common one.  (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
-         * If there are no `m.foci` in the `m.call.member`, then we connect full mesh.
-         * If subsequently `m.foci` are introduced into the conference, then we should transfer the call to them (effectively doing a 1:1->group call upgrade).
-     * If the client does have an SFU configured, then we decide whether to use it. 
-         * If other conf participants are already using it, then we use it.
-         * If there are other users from our homeserver in the conference, then we use it (as presumably they should be using it too)
-         * If there are no other `m.foci` (either in the `m.call` or in the participant state) then we use it.
-         * Otherwise, we save bandwidth on our SFU by not cascading and instead behaving as if we had no SFU configured.
- * We do not recommend that users utilise an SFU to hide behind for privacy, but instead use a TURN server, only providing relay candidates, rather than consuming SFU resources and unnecessarily mandating the presence of an SFU.
-
-TODO: spec how clients discover their homeserver's preferred SFU foci
-
-Originally this proposal suggested that foci should be identified by their `(user_id, device_id)` rather than just their user_id, in order to ensure convergence on the same device.  In practice, this is unnecessary complication if we make it the SFU implementor's problem to ensure that either only one device is logged in per SFU user - or instead you cluster the SFU devices together for the same user.  It's important to note that when calling an SFU you should call `*` devices.
-
-### SFU control
-
-SFUs are Selective Forwarding Units - a server which forwarding WebRTC streams between peers (which could be clients or SFUs or both).  To make use of them effectively, peers need to be able to tell the SFU which streams they want to receive, and the SFU must tell the peers which streams it wants to be sent.  We also need a way of telling SFUs which other SFUs to connect ("cascade") to.
-
-The client does this by establishing an optional datachannel connection to the SFU using normal `m.call.invite`, in order to perform low-latency signalling to rapidly select streams.
-
-To select a stream over this channel, the peer sends:
-
-```jsonc
-{
-    "op": "select",
-    "conf_id": "cvsiu2893",
-    "start": [
-        { "stream_id": "qegwy64121wqw", "track_id": "zbhsbdhwe" }
-        { "stream_id": "qegwy64121wqw", "track_id": "zbhsbdhzs" }
-    ],
-    "stop": [
-        { "stream_id": "suigv372y8378", "track_id": "xbhsbdhzs" }
-    ]    
-}
-```
-
-All streams are sent within a single media session (rather than us having multiple sessions or calls), and there is no difference between a peer sending simulcast streams from a webcam versus two SFUs trunking together.
-
-If no DC is established, then 1:1 calls should send all streams without prompting, and SFUs should send no streams by default.
-
-If you are using your SFU in a call, it will need to know how to connect to other SFUs present in order to participate in the fullmesh of SFU traffic (if any).  One option here is for SFUs to act as an AS and sniff the `m.call.member` traffic of their associated server, and automatically call any other `m.foci` which appear.  (They don't need to make outbound calls to clients, as clients always dial in).  Otherwise, we could consider an `"op": "connect"` command sent by clients, but then you have the problem of deciding which client(s) are responsible for reminding the SFU to connect to other SFUs.  Much better to trust the server.
-
-Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.call` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.call`.  (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)
-
-XXX: define how these DC messages muxes with other traffic, and consider what message encoding to actually use.
-
-TODO: spell out how this works with active speaker detection & associated signalling
 
 ## Example Diagrams
 
@@ -270,7 +220,6 @@ TODO: spell out how this works with active speaker detection & associated signal
 | Solid | [State Event](https://spec.matrix.org/latest/client-server-api/#types-of-room-events) |
 | Dashed | [Event (sent as to-device message)](https://spec.matrix.org/latest/client-server-api/#send-to-device-messaging) |
 
-
 ### Basic Call
 
 ```mermaid
@@ -290,46 +239,22 @@ sequenceDiagram
     Alice-->>Bob: m.call.select_answer
 ```
 
-## Encryption
-
-We get E2EE for 1:1 and full mesh calls automatically in this model.
-
-However, when SFUs are on the media path, the SFU will necessarily terminate the SRTP traffic from the peer, breaking E2EE.  To address this, we apply an additional end-to-end layer of encryption to the media using [WebRTC Encoded Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md) (formerly Insertable Streams) via [SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).
-
-In order to provide PFS, The symmetric key used for these stream from a given participating device is a megolm key. Unlike a normal megolm key, this is shared via `m.room_key` over Olm to the devices participating in the conference including an `m.call_id` and `m.room_id` field on the key to correlate it to the conference traffic, rather than using the `session_id` event field to correlate (given the encrypted traffic is SRTP rather than events, and we don't want to have to send fake events from all senders every time the megolm session is replaced).
-
-The megolm key is ratcheted forward for every SFrame, and shared with new participants at the current index via `m.room_key` over Olm as per above.  When participants leave, a new megolm session is created and shared with all participants over Olm.  The new session is only used once all participants have received it.
-
 ## Potential issues
 
 To-device messages are point-to-point between servers, whereas today's `m.call.*` messages can transitively traverse servers via the room DAG, thus working around federation problems.  In practice if you are relying on that behaviour, you're already in a bad place.
 
-The SFUs participating in a conference end up in a full mesh.  Rather than inventing our own spanning-tree system for SFUs however, we should fix it for Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or similar to decide what better-than-full-mesh topology to use.  In practice, full mesh cascade between SFUs is probably not that bad (especially if SFUs only request the streams over the trunk their clients care about) - and on aggregate will be less obnoxious than all the clients hitting a single SFU.
-
-SFrame mandates its own ratchet currently which is almost the same as megolm but not quite.  Switching it out for megolm seems reasonable right now (at least until MLS comes along)
-
 ## Alternatives
 
 There are many many different ways to do this.  The main other alternative considered was not to use state events to track membership, but instead gossip it via either to-device or DC messages between participants.  This fell apart however due to trust: you effectively end up reinventing large parts of Matrix layered on top of to-device or DC.  So you might as well publish and distribute the participation data in the DAG rather than reinvent the wheel.
 
-Another option is to treat 1:1 (and full mesh) entirely differently to SFU based calling rather than trying to unify them.  Also, it's debatable whether supporting full mesh is useful at all.  In the end, it feels like unifying 1:1 and SFU calling is for the best though, as it then gives you the ability to trivially upgrade 1:1 calls to group calls and vice versa, and avoids maintaining two separate hunks of spec.  It also forces 1:1 calls to take multi-stream calls seriously, which is useful for more exotic capture devices (stereo cameras; 3D cameras; surround sound; audio fields etc).
-
 An alternative to to-device messages is to use DMs.  You still risk gappy sync problems though due to lots of traffic, as well as the hassle of creating DMs and requiring canonical DMs to set up the calls.  It does make debugging easier though, rather than having to track encrypted ephemeral to-device msgs.
 
 ## Security considerations
 
 State events are not encrypted currently, and so this leaks that a call is happening, and who is participating in it, and from which devices.
 
-Malicious users could try to DoS SFUs by specifying them as their foci.
-
-SFrame E2EE may go horribly wrong if we can't send the new megolm session fast enough to all the participants when a participant leave (and meanwhile if we keep using the old session, we're technically leaking call media to the parted participant until we manage to rotate).
-
-Need to ensure there's no scope for media forwarding loops through SFUs.
-
 Malicious users in a room could try to sabotage a conference by overwriting the `m.call` state event of the current ongoing call.
 
-Too many foci will chew bandwidth due to fullmesh between them.  In the worst case, if every use is on their own HS and picks a different foci, it degenerates to a fullmesh call (just serverside rather than clientside).  Hopefully this shouldn't happen as you will converge on using a single SFU with the most clients, but need to check how this works in practice.
-
 ## Unstable prefix
 
 | stable event type | unstable event type |

From 3fde32b16dfbb8c37494fcb0dfdd57c4d6afc3ce Mon Sep 17 00:00:00 2001
From: Robin <robin@robin.town>
Date: Wed, 30 Nov 2022 11:07:51 -0500
Subject: [PATCH 19/24] Move expiration timestamps to be per-device (#3941)

As discussed with Robert Long
---
 proposals/3401-group-voip.md | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index f1522bc570e..2eb96d9c3e2 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -101,11 +101,7 @@ We mandate at most one call per room at any given point to avoid UX nightmares -
 
 ### Call participation
 
-Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it).  The event contains a timestamp named `m.expires_ts` describing when this data should be considered stale, and an array `m.calls` of objects describing which calls the user is participating in within that room.  This array must contain one item (for now).
-
-When sending an `m.call.member` event, clients should choose a reasonable value for `m.expires_ts` in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp.
-
-`m.call.member` state events must be ignored if the `m.expires_ts` field indicates they have expired, or if their user's `m.room.member` event's membership field is not `join`.
+Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it).  The event contains an array `m.calls` of objects describing which calls the user is participating in within that room.  This array must contain one item (for now).
 
 The fields within the item in the `m.calls` contents are:
 
@@ -113,6 +109,7 @@ The fields within the item in the `m.calls` contents are:
  * `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
    * `device_id` - The device id to use for to-device messages when establishing a call
    * `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
+   * `expires_ts` -  A timestamp describing when this device data should be considered stale. When updating their own device state, clients should choose a reasonable value for `expires_ts` in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp. A device must be ignored if the `expires_ts` field indicates it has expired, or if the user's `m.room.member` event's membership field is not `join`.
    * `feeds` - Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.
      * `purpose` - Either `m.usermedia` or `m.screenshare` otherwise the feed should be ignored.
 
@@ -130,6 +127,7 @@ For instance:
                     {
                         "device_id": "ASDUHDGFYUW", // Used to target to-device messages
                         "session_id": "GHKJFKLJLJ", // Used to resolve duplicate calls from a device
+                        "expires_ts": 1654616071686,
                         "feeds": [
                             {
                                 "purpose": "m.usermedia",
@@ -183,7 +181,6 @@ For instance:
                 ]
             }
         ],
-        "m.expires_ts":  1654616071686
     }
 }
 ```

From 05b5db22d830d3ee2c3f230fba1bc3560af7912f Mon Sep 17 00:00:00 2001
From: Robin <robin@robin.town>
Date: Wed, 30 Nov 2022 11:08:01 -0500
Subject: [PATCH 20/24] Specify who calls who (#3942)

---
 proposals/3401-group-voip.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 2eb96d9c3e2..911d78b8aad 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -197,6 +197,8 @@ Clients should do their best to ensure that calls in `m.call.member` state are r
 
 ### Call setup
 
+In a full mesh call, for any two participants, the one with the lexicographically lower user ID is responsible for calling the other. If two participants share the same user ID (that is, if a user has joined the call from multiple devices), then the one with the lexicographically lower device ID is responsible for calling the other.
+
 Call setup then uses the normal `m.call.*` events, except they are sent over to-device messages to the relevant devices (encrypted via Olm).  This means:
 
  * When initiating a 1:1 call, the `m.call.invite` is sent to the devices listed in `m.call.member` event's `m.devices` array using the `device_id` field.

From 43dc42fd84251f171527fe13246b68a5d808014d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C5=A0imon=20Brandner?= <simon.bra.ag@gmail.com>
Date: Sat, 3 Dec 2022 12:25:59 +0100
Subject: [PATCH 21/24] Clarify `expires_ts`
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
---
 proposals/3401-group-voip.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 911d78b8aad..2f0cf734666 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -109,7 +109,7 @@ The fields within the item in the `m.calls` contents are:
  * `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
    * `device_id` - The device id to use for to-device messages when establishing a call
    * `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
-   * `expires_ts` -  A timestamp describing when this device data should be considered stale. When updating their own device state, clients should choose a reasonable value for `expires_ts` in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp. A device must be ignored if the `expires_ts` field indicates it has expired, or if the user's `m.room.member` event's membership field is not `join`.
+   * `expires_ts` - A POSIX timestamp in milliseconds describing when this device data should be considered stale. When updating their own device state, clients should choose a reasonable value for `expires_ts` in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp. A device must be ignored if the `expires_ts` field indicates it has expired, or if the user's `m.room.member` event's membership field is not `join`.
    * `feeds` - Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.
      * `purpose` - Either `m.usermedia` or `m.screenshare` otherwise the feed should be ignored.
 

From 5635cee60ce8b5bcffc050a246a0752291536b2b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C5=A0imon=20Brandner?= <simon.bra.ag@gmail.com>
Date: Sat, 3 Dec 2022 12:33:12 +0100
Subject: [PATCH 22/24] Add `seq`
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
---
 proposals/3401-group-voip.md | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index 2f0cf734666..e0512c6b947 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -201,14 +201,21 @@ In a full mesh call, for any two participants, the one with the lexicographicall
 
 Call setup then uses the normal `m.call.*` events, except they are sent over to-device messages to the relevant devices (encrypted via Olm).  This means:
 
- * When initiating a 1:1 call, the `m.call.invite` is sent to the devices listed in `m.call.member` event's `m.devices` array using the `device_id` field.
- * `m.call.*` events sent via to-device messages should also include the following properties in their content:
-   * `conf_id` - The group call id listed in `m.call`
-   * `dest_session_id` - The recipient's session id. Incoming messages with a `dest_session_id` that doesn't match your current session id should be discarded.
- * In addition to the fields above `m.call.invite` events sent via to-device messages should include the following properties  :
-   * `device_id` - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the `m.call.member` event has not been received yet.
-   * `sender_session_id` - Like the `device_id` the `sender_session_id` is used by the opponent member to filter out messages unrelated to the sender's session even if the `m.call.member` event has not been received yet.
- * For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
+* When initiating a 1:1 call, the `m.call.invite` is sent to the devices listed in `m.call.member` event's `m.devices` array using the `device_id` field.
+* `m.call.*` events sent via to-device messages should also include the following properties in their content:
+  * `conf_id` - The group call id listed in `m.call`
+  * `dest_session_id` - The recipient's session id. Incoming messages with a
+    `dest_session_id` that doesn't match your current session id should be
+    discarded.
+  * `seq` - The sequence number of the to-device message. This is done since the
+    order of to-device messages is not guaranteed. With each new to-device
+    message this number gets incremented by `1` and it starts at `0`
+* In addition to the fields above `m.call.invite` events sent via to-device messages should include the following properties  :
+  * `device_id` - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the `m.call.member` event has not been received yet.
+  * `sender_session_id` - Like the `device_id` the `sender_session_id` is used
+     by the opponent member to filter out messages unrelated to the sender's
+     session even if the `m.call.member` event has not been received yet.
+* For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
 
 ## Example Diagrams
 

From b8ebe275b20f9f1b9652aa54a26ba8258d9641cb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C5=A0imon=20Brandner?= <simon.bra.ag@gmail.com>
Date: Sat, 3 Dec 2022 12:33:39 +0100
Subject: [PATCH 23/24] Use heading for Legend
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
---
 proposals/3401-group-voip.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index e0512c6b947..ff79e78bc27 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -219,7 +219,7 @@ Call setup then uses the normal `m.call.*` events, except they are sent over to-
 
 ## Example Diagrams
 
-**Legend**
+### Legend
 
 | Arrow Style | Description |
 |-------------|-------------|

From 6b98d667cf634f78c6604151276d5ef25d305aac Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C5=A0imon=20Brandner?= <simon.bra.ag@gmail.com>
Date: Sat, 3 Dec 2022 12:36:10 +0100
Subject: [PATCH 24/24] Fix-up some formatting
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
---
 proposals/3401-group-voip.md | 101 ++++++++++++++++++++++++-----------
 1 file changed, 71 insertions(+), 30 deletions(-)

diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
index ff79e78bc27..742ce5c6048 100644
--- a/proposals/3401-group-voip.md
+++ b/proposals/3401-group-voip.md
@@ -9,11 +9,19 @@ avoid making this MSC too large.
 VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room.
 This has some limitations, especially if you try to broaden the approach to multiparty VoIP calls:
 
- * VoIP signalling can generate a lot of events as candidates are incrementally discovered, and for rapid call setup these need to be relayed as rapidly as possible.
-   * Putting these into the room timeline means that if the client has a gappy sync, for VoIP to be reliable it will need to go back and fill in the gap before it can process any VoIP events, slowing things down badly.
-   * Timeline events are (currently) subject to harsh rate limiting, as they are assumed to be a spam vector.
- * VoIP signalling leaks IP addresses.  There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
- * Candidates are ephemeral data, and there is no reason to keep them around for posterity - they're just clogging up the DAG.
+* VoIP signalling can generate a lot of events as candidates are incrementally
+  discovered, and for rapid call setup these need to be relayed as rapidly as
+  possible.
+  * Putting these into the room timeline means that if the client has a gappy
+    sync, for VoIP to be reliable it will need to go back and fill in the gap
+    before it can process any VoIP events, slowing things down badly.
+  * Timeline events are (currently) subject to harsh rate limiting, as they are
+    assumed to be a spam vector.
+* VoIP signalling leaks IP addresses.  There is no reason to keep these around
+  for posterity, and they should only be exposed to the devices which care about
+  them.
+* Candidates are ephemeral data, and there is no reason to keep them around for
+  posterity - they're just clogging up the DAG.
 
 Meanwhile we have no native signalling for group calls at all, forcing you to instead embed a separate system such as Jitsi, which has its own dependencies and doesn't directly leverage any of Matrix's encryption, decentralisation, access control or data model.
 
@@ -27,15 +35,17 @@ the early flawed sketch at
 
 This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.
 
-Diagramatically, this looks like:
+Diagrammatically, this looks like:
 
 1:1:
-```
+
+```diagram
           A -------- B
 ```
 
 Full mesh between clients
-```
+
+```diagram
           A -------- B
            \       /
             \     /
@@ -45,7 +55,8 @@ Full mesh between clients
 ```
 
 SFU (aka Focus):
-```
+
+```diagram
           A __    __ B
               \  /   
                F 
@@ -56,7 +67,8 @@ Where F is an SFU focus
 ```
 
 Cascaded decentralised SFU:
-```
+
+```diagram
      A1 --.           .-- B1
      A2 ---Fa ----- Fb--- B2
            \       /
@@ -74,14 +86,28 @@ Where Fa, Fb and Fc are SFU foci, one per homeserver, each with two clients.
 
 The user who wants to initiate a call sends a `m.call` state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using `m.terminated`). Its body has the following fields:
 
- * `m.intent` to describe the intended UX for handling the call.  One of:
-     * `m.ring` if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)
-     * `m.prompt` is the call should be presented as a conference call which users in the room are prompted to connect to
-     * `m.room` if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
- * `m.type` to say whether the initial type of call is voice only (`m.voice`) or video (`m.video`).  This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.
- * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).  
- * `m.name` as an optional human-visible label for the call (e.g. "Conference call").
- * The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable).  If there are multiple non-terminated conf ID state events in the room, the client should display the most recently edited event.
+* `m.intent` to describe the intended UX for handling the call.  One of:
+  * `m.ring` if the call is meant to cause the room participants devices to ring
+    (e.g. 1:1 call or group call)
+  * `m.prompt` is the call should be presented as a conference call which users
+    in the room are prompted to connect to
+  * `m.room` if the call should be presented as a voice/video channel in which
+    the user is immediately immersed on selecting the room.
+* `m.type` to say whether the initial type of call is voice only (`m.voice`) or
+  video (`m.video`).  This signals the intent of the user when placing the call
+  to the participants (i.e. "i want to have a voice call with you" or "i want to
+  have a video call with you") and warns the receiver whether they may be
+  expected to view video or not, and provide suitable initial UX for displaying
+  that type of call... even if it later gets upgraded to a video call.
+* `m.terminated` if this event indicates that the call in question has finished,
+  including the reason why. (A voice/video room will never terminate.) (do we
+  need a duration, or can we figure that out from the previous state event?).  
+* `m.name` as an optional human-visible label for the call (e.g. "Conference
+  call").
+* The State key is a unique ID for that call. (We can't use the event ID, given
+  `m.type` and `m.terminated` is mutable).  If there are multiple non-terminated
+  conf ID state events in the room, the client should display the most recently
+  edited event.
 
 For instance:
 
@@ -105,13 +131,30 @@ Users who want to participate in the call declare this by publishing a `m.call.m
 
 The fields within the item in the `m.calls` contents are:
 
- * `m.call_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match an unterminated `m.call` event, it should be ignored.
- * `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
-   * `device_id` - The device id to use for to-device messages when establishing a call
-   * `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
-   * `expires_ts` - A POSIX timestamp in milliseconds describing when this device data should be considered stale. When updating their own device state, clients should choose a reasonable value for `expires_ts` in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp. A device must be ignored if the `expires_ts` field indicates it has expired, or if the user's `m.room.member` event's membership field is not `join`.
-   * `feeds` - Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.
-     * `purpose` - Either `m.usermedia` or `m.screenshare` otherwise the feed should be ignored.
+* `m.call_id` - the ID of the conference the user is claiming to participate in.
+  If this doesn't match an unterminated `m.call` event, it should be ignored.
+* `m.devices` - The list of the member's active devices in the call. A member
+  may join from one or more devices at a time, but they may not have two active
+  sessions from the same device. Each device contains the following properties:
+  * `device_id` - The device id to use for to-device messages when establishing
+    a call
+  * `session_id` - A unique identifier used for resolving duplicate sessions
+    from a given device. When the `session_id` field changes from an incoming
+    `m.call.member` event, any existing calls from this device in this call
+    should be terminated. `session_id` should be generated once per client
+    session on application load.
+  * `expires_ts` - A POSIX timestamp in milliseconds describing when this device
+    data should be considered stale. When updating their own device state,
+    clients should choose a reasonable value for `expires_ts` in case they go
+    offline unexpectedly. If the user stays connected for longer than this time,
+    the client must actively update the state event with a new expiration
+    timestamp. A device must be ignored if the `expires_ts` field indicates it
+    has expired, or if the user's `m.room.member` event's membership field is
+    not `join`.
+  * `feeds` - Contains an array of feeds the member is sharing and the opponent
+    member may reference when setting up their WebRTC connection.
+    * `purpose` - Either `m.usermedia` or `m.screenshare` otherwise the feed
+      should be ignored.
 
 For instance:
 
@@ -187,11 +230,9 @@ For instance:
 
 This builds on [MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077), which describes streams in `m.call.*` events via a `sdp_stream_metadata` field.
 
-** TODO: Do we need all of this data? Why would we need it? **
-** TODO: This doesn't follow the MSC3077 format very well - can we do something
-about that? **
-** TODO: Add tracks field **
-** TODO: Add bitrate/format fields **
+**TODO: Do we need all of this data? Why would we need it?** **TODO: This
+doesn't follow the MSC3077 format very well - can we do something about that?**
+**TODO: Add tracks field** **TODO: Add bitrate/format fields**
 
 Clients should do their best to ensure that calls in `m.call.member` state are removed when the member leaves the call. However, there will be cases where the device loses network connectivity, power, the application is forced closed, or it crashes. If the `m.call.member` state has stale device data the call setup will fail. Clients should re-attempt invites up to 3 times before giving up on calling a member.