Pick a unit for describing string length #1001

turt2live · 2022-03-24T19:40:27Z

We talk about unicode codepoints, bytes, characters, and just the word "length" to describe how long a string is allowed to be. This is confusing at best, so ideally we pick a unit and stick to it.

Some areas may require an MSC to make uniform.

An example of where this is a problem is state keys: They're limited to 255 bytes but user IDs are limited to 255 characters, so by extension 1 character == 1 byte because otherwise membership wouldn't work.

turt2live · 2022-08-28T08:20:22Z

It appears that synapse is actually using codepoints rather than bytes for user IDs, so it's already possible to create users with >255 bytes to them.

This particular piece would be a spec clarification, I think.

deepbluev7 · 2022-08-28T08:22:57Z

My client hardcoded the byte limit, since that is what I found in the client side spec, when I asked for it.

deepbluev7 · 2022-08-28T08:28:02Z

And the database used in some clients is limited 255 UTF-16 code units too: https:/hivedb/hive/blob/master/hive/lib/src/binary/frame.dart#L55

tulir · 2022-08-28T08:31:46Z

Synapse also uses python str len() (i.e. unicode codepoints) for measuring room IDs, state keys and event types: https:/matrix-org/synapse/blob/v1.65.0/synapse/event_auth.py#L327-L339

PDUs are not affected and are actually limited to 65536 bytes, thanks to encode_canonical_json encoding the result as utf-8 and returning bytes instead of str.

Room aliases can also be created with lots of bytes, but I'm not sure if those are actually limited at all (it's not a part of event auth, so it might even allow more than 255 codepoints).

deepbluev7 · 2022-08-28T08:38:19Z

so it's already possible to create users with >255 bytes to them.

Only if you patch your Synapse. Unicode characters are not allowed in mxids by default, iirc.

…epoints rather than bytes (matrix-org/matrix-spec#1001)

…epoints rather than bytes (update to matrix-org/gomatrixserverlib@5f66df0, matrix-org/matrix-spec#1001)

…es rather than codepoints (matrix-org/matrix-spec#1001)

In almost all instances we are already talking about UTF-8 encoded ASCII. As such this is not a behavioural change apart from in theory legacy identifiers, that never were part of the spec. fixes matrix-org#1001 Signed-off-by: Nicolas Werner <[email protected]>

reivilibre · 2022-09-09T11:41:44Z

On the subject of user_id/sender_id, room_id, state_key, type, event_id:

In favour of bytes:

Client Size limits mentions bytes.
- Changing the definition for clients is likely to break clients and it will be difficult to get them all fixed.
  - Strict limits are especially important in Embedded (and/or 'IoT') use cases.
Bytes are a less ambiguous term than 'characters'.
https://spec.matrix.org/v1.3/appendices/#user-identifiers mentions '255 characters' for user ID, but also imposes the restriction of using ASCII, even in the historical User ID section
The number 255 is most sensible in units of bytes, especially for use in clients' storage engines. Database systems don't usually count storage size in codepoints.
- LMDB, a widely-known embedded database that you may choose to use in your client, has a key size limit of 511 bytes. This means that 255 codepoints can't generally be stored in the key of LMDB; it's an easy footgun since a client may not test such awkward cases. Bytes are simpler here.
'Counting' bytes is easier than counting Unicode codepoints — you don't have to scan through UTF-8
Until recently, Dendrite and Conduit applied these limits in bytes.

Against bytes:

Synapse's behaviour was speculated to have changed when Python 3 started to be used, but it appears that this isn't true — json.decode returned Python 2 u-strings, whose len counts in codepoints.
- In effect, it seems like it has historically been counted in codepoints rather than bytes.
- (Tested back to Synapse 0.21.0/May 2017 by Nyaaori)
- Edit: as of Fix event size checks synapse#13710, it now uses bytes for most checks.
It may be possible to imagine use cases for state keys where having a different length limit depending on whether you use multi-byte 'characters' is awkward for users (but what's a character? Codepoint? Grapheme? I think it would be awkward anyway)

Remarks:

2 rooms have been created which make homeservers diverge in their behaviour and appear 'defederated' from each other, however this was intentionally caused by room admins.
It would be good to make sure we don't write any new parts of spec with such loose terminology in the future; especially noting that 'character' is quite widely considered a meaningless or ambiguous term.
I think there should be an 'Event Specification' part of the spec rather than defining events in the CS-API or SS-API, especially since we may end up with conflicting definitions. The current split makes it hard to find definitions and easy for things to fall out of sync.

FSG-Cat · 2022-09-11T12:27:36Z

Please note that this is a highly theoretical thing that i never got a clear answer on when we talked about it so im just putting it here to make sure we dont forget it.

There might be the possibility of for Synapse atleast causing a room to be Bricked if we do merge a fix like matrix-org/synapse#13710 because of one simple theoretical attack that i will mention here because of a simple fact. All the details are public and this should be a non issue if we just accept this has to be a room version level thing like String PL. All the 3 implementations that support v10 already supports this behavior and i susspect conduit also supports it but cant state that as fact.

You can attack Synapse theoretically by pushing a join event for a overflow user and having another Synapse issue a Control event in close proximity to that overflow join. Because Synapse would fail the correct events because of the invalid parent. Dendrite as far as i understand will chug along happily and not break in this failure mode.

Going with the Retroactive fixing path literally opens up an attack on rooms if i am correct and i hope i am wrong. I very much hope so. The attack is probably avoidable if you push your control event onto a state branch that is not the branch that has the in the future possibily invalid event in it.

Also i want to add to this that there are concerns that the databases of Catalyst servers in these rooms will be harmed irreparbly as stated by Nyaaori the Catalyst Dev. I cant say if this also applies for Conduit but it would not shock me.

As a last point i want to add that Nyaaoris research into when this problem came about has found it literally always existed or it was created literally in the begining days of Synapse.

Edit: Clarification. The attack is not technically in need of multiple homeservers its just as an example if we assume that your under attack and your own homeserver is not going to send overflow events. But yes the attack can be executed using a single homeserver if the attack actually is possible.

FSG-Cat · 2022-09-11T13:12:48Z

Im putting this separately as i dont think its suitable to put this in an edit.

Merging a fix to this issue that is not v11 fixes this would set a precedent where Spec clarifications are allowed to Brick rooms. Nyaaoris testing seems to imply that literally every single synapse version shares this flaw. String PL didnt brick rooms by being merged it unbricked rooms for servers who updated to be compliant. But this case we are instead asking to brick rooms that contain events accepted by all Synapse versions as valid. And in the string PLs case the same argument of only a limited amount of rooms affected could be made but we didnt brick those rooms that time.

deepbluev7 · 2022-09-11T20:40:30Z

How would this brick a room? To brick a room, it would need to affect event authorization. And from what I can tell, this only bricks rooms where a server was intentionally patched to remove client side limits and admin in that room and intentionally abused this. Which is very far in the fuck around and find out territory. In the worst case it makes a few events vanish in rooms for unpatched servers.

But let's look deeper into what this could affect. To "brick" a room, such an event would need to be in the auth chain.

So what events are in the auth chain? m.room.create, m.room.power_levels and m.room.member (of the sender of some events). The only way to pull in a member event of a different member B into the auth chain of user A, is when an invite is sent or when a restricted join is happening through that server. To then break a room, member A needs to send a powerlevel event. Any history after that point will be reset, after this is fixed in a server.

Just a different server sending a control event will not break the room. Yes, it will make the prev events for that event invalid, but prev events don't matter for the auth chain.

I don't think it is bad to brick rooms, that intentionally had admins sending spec violating events.

If this isn't fixed, then some clients won't be able to interact with those rooms for quite a while, while currently the count of broken rooms with this fixed (in the one server implementation implementing this wrong for federation side checks) it so far only breaks rooms that intentionally messed around with this.

On a personal note, I do find it quite challenging how many people advocate for keeping broken behaviour in the spec and increasing its complexity to not break some theoretical rooms, that don't seem to exist in practice, while this bug completely broke at least one client when abused, because it relied on what the spec said and defederated rooms (that were intentionally using broken users) between 3 spec following implementations and synapse on the other hand. Also, the spec says nowhere, that state keys of this size are allowed. The only place where one could argue this is the case, is the identifier grammar, which says 255 characters, but also restricts those characters to ASCII only, so this is still equal to 255 bytes. Sure, there is the historical id section, but that is to grandfather existing users in from old Synapse instances, that never allowed emojis in usernames.

…epoints rather than bytes (update to matrix-org/gomatrixserverlib@5f66df0, matrix-org/matrix-spec#1001)

FSG-Cat · 2022-09-17T18:32:35Z

Just wanted to provide an update on my own research on this topic so its documented here in case its not already.

State keys and State event types that are made up of 242 🐈 emojies are allowed to be sent by a stock Synapse tested via feline.support that is deployed via Ansible playbook found at https:/spantaleev/matrix-docker-ansible-deploy. This playbook as of time of writing deploys the official matrix.org docker image for Synapse 1.67.0. Client utilised in the tests todays copy of Element Nightly Desktop. So the copy from 2022-09-17.

I tested sending a state event with the type of 242 🐈 emojies and with the same as the key. And also regular event of the same type. State event was sent in a room created using the standard template for public rooms by me a 100 PL user. I replicated the sending of the non state event version in multiple other rooms during my recent testing without issue and in rooms where i am at default PL.

This comment is just intended to document that current synapse does not verify state keys as not containing emoji or event types not containing them at all. Or if this check is done having the homeserver admin flag nullifies it.

erikjohnston · 2022-10-20T09:13:22Z

In the interest of moving this forwards, I think we should land this as bytes. Reading through the comments it feels like the main reasons for not using bytes is for backwards compatibility, and in my mind that is not a compelling enough reason to keep that behaviour in the long term.

FWIW my view is that using anything other than bytes is fraught with difficulty, e.g. limiting to 255 unicode code points doesn't restrict the length of the rendered string, and many languages don't have in built ways of determining the "true" length of a unicode string.

I also don't think its worth keeping the current behaviour for existing rooms, given the failure mode isn't truly terrible and there are very few reported incidents about this.

cc. @neilalexander & @timokoesters as other HS developers that may want to chime in. Unless I hear objections soon then the Synapse team is going to merge matrix-org/synapse#13710

neilalexander · 2022-10-21T07:46:04Z

I'm happy to go with bytes — matrix-org/gomatrixserverlib#338 is ready and waiting.

jplatte · 2022-10-21T08:00:23Z

Ruma and by extension Conduit is already enforcing 255 bytes. It took a long time until we even learned Synapse did something else, and since it became clear that it's probably feasible for Synapse to do a small breaking change there, I've been pushing back against one person's infrequent asks for Synapse compatibility on this.

@reivilibre

…es rather than codepoints (#338) This effectively reverts the change made in [5f66df0](5f66df0) to bytes instead of codepoints, since Synapse will now enforce the same after matrix-org/synapse#13710. History here is that Synapse originally calculated bytes in Python 2.x, started counting codepoints in Python 3.x pretty much by accident and then the spec was ambiguous after the fact (hence matrix-org/matrix-spec#1001). Rationale is that bytes are probably easier for implementations to manage and less likely to generate huge indexes for client-side databases (especially where limits might exist like LMDB). cc @reivilibre

richvdh · 2024-05-24T09:30:27Z

In the interest of moving this forwards, I think we should land this as bytes

Here we are two years later, having failed to move this forward in any material way :(.

I don't see it mentioned above, but https://spec.matrix.org/v1.10/client-server-api/#size-limits already specfies that the limits for a bunch of identifiers are bytes. IMHO this information should be moved to the appendices.

…larify that the length is to be measured in bytes Fixes: matrix-org#1826 Relates to: matrix-org#1001 Signed-off-by: Johannes Marbach <[email protected]>

Johennes · 2024-06-07T19:30:40Z

I took a stab at this in #1850.

Are there any places in the spec other than user IDs where a standardization on bytes would be required?

The remaining places that use "characters" (e.g. this or this) appear to also restrict the character set to ASCII (or even less than that). So I suppose the term "character" should be unambiguous there?

…larify that the length is to be measured in bytes (#1850) Fixes: #1826 Relates to: #1001 Signed-off-by: Johannes Marbach <[email protected]>

richvdh · 2024-06-12T08:14:02Z

I think this is fixed by #1850 - thanks @Johennes.

There might be other places in the spec that need clarification, but I can't immediately see any. I suggest if anyone knows of any, they open a new issue with a more precise description

turt2live added the clarification An area where the expected behaviour is understood, but the spec could do with being more explicit label Mar 24, 2022

neilalexander added a commit to matrix-org/gomatrixserverlib that referenced this issue Aug 28, 2022

Enforce type, sender, state_key and room_id lengths using cod…

5f66df0

…epoints rather than bytes (matrix-org/matrix-spec#1001)

neilalexander added a commit to matrix-org/dendrite that referenced this issue Aug 28, 2022

Enforce type, sender, state_key and room_id lengths using cod…

a3eb4e5

…epoints rather than bytes (update to matrix-org/gomatrixserverlib@5f66df0, matrix-org/matrix-spec#1001)

deepbluev7 mentioned this issue Sep 4, 2022

Fix event size checks matrix-org/synapse#13710

Merged

4 tasks

neilalexander added a commit to matrix-org/gomatrixserverlib that referenced this issue Sep 9, 2022

Enforce type, sender, state_key and room_id lengths using byt…

c294a99

…es rather than codepoints (matrix-org/matrix-spec#1001)

neilalexander mentioned this issue Sep 9, 2022

Enforce type, sender, state_key and room_id lengths using bytes rather than codepoints matrix-org/gomatrixserverlib#338

Merged

deepbluev7 mentioned this issue Sep 9, 2022

Clarify that size limits are in bytes #1234

Closed

turt2live added the blocked Something needs to be done before action can be taken on this PR/issue. label Sep 11, 2022

0x1a8510f2 pushed a commit to 0x1a8510f2/dendrite that referenced this issue Sep 12, 2022

Enforce type, sender, state_key and room_id lengths using cod…

0fd37aa

…epoints rather than bytes (update to matrix-org/gomatrixserverlib@5f66df0, matrix-org/matrix-spec#1001)

ShadowJonathan mentioned this issue Oct 21, 2022

Test unit length compliance (255 bytes vs unicode points) matrix-org/complement#531

Open

DMRobertson mentioned this issue Aug 29, 2023

Add test for state_key size matrix-org/complement#642

Open

richvdh mentioned this issue Feb 27, 2024

Better specify how to validate events #365

Open

richvdh removed the blocked Something needs to be done before action can be taken on this PR/issue. label Feb 27, 2024

richvdh mentioned this issue May 23, 2024

Clarify Room ID requirements #1826

Closed

Johennes mentioned this issue Jun 7, 2024

Move size limits for user, room and event IDs into the appendix and clarify that the length is to be measured in bytes #1850

Merged

3 tasks

richvdh pushed a commit that referenced this issue Jun 12, 2024

Move size limits for user, room and event IDs into the appendix and c…

7916032

…larify that the length is to be measured in bytes (#1850) Fixes: #1826 Relates to: #1001 Signed-off-by: Johannes Marbach <[email protected]>

richvdh closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pick a unit for describing string length #1001

Pick a unit for describing string length #1001

turt2live commented Mar 24, 2022

turt2live commented Aug 28, 2022

deepbluev7 commented Aug 28, 2022

deepbluev7 commented Aug 28, 2022

tulir commented Aug 28, 2022

deepbluev7 commented Aug 28, 2022

reivilibre commented Sep 9, 2022 •

edited by richvdh

Loading

FSG-Cat commented Sep 11, 2022 •

edited

Loading

FSG-Cat commented Sep 11, 2022 •

edited

Loading

deepbluev7 commented Sep 11, 2022

FSG-Cat commented Sep 17, 2022

erikjohnston commented Oct 20, 2022

neilalexander commented Oct 21, 2022

jplatte commented Oct 21, 2022

richvdh commented May 24, 2024

Johennes commented Jun 7, 2024

richvdh commented Jun 12, 2024

Pick a unit for describing string length #1001

Pick a unit for describing string length #1001

Comments

turt2live commented Mar 24, 2022

turt2live commented Aug 28, 2022

deepbluev7 commented Aug 28, 2022

deepbluev7 commented Aug 28, 2022

tulir commented Aug 28, 2022

deepbluev7 commented Aug 28, 2022

reivilibre commented Sep 9, 2022 • edited by richvdh Loading

FSG-Cat commented Sep 11, 2022 • edited Loading

FSG-Cat commented Sep 11, 2022 • edited Loading

deepbluev7 commented Sep 11, 2022

FSG-Cat commented Sep 17, 2022

erikjohnston commented Oct 20, 2022

neilalexander commented Oct 21, 2022

jplatte commented Oct 21, 2022

richvdh commented May 24, 2024

Johennes commented Jun 7, 2024

richvdh commented Jun 12, 2024

reivilibre commented Sep 9, 2022 •

edited by richvdh

Loading

FSG-Cat commented Sep 11, 2022 •

edited

Loading

FSG-Cat commented Sep 11, 2022 •

edited

Loading