From 36b1b39b008d87bcec8ddb3d3392ccb20e6e6175 Mon Sep 17 00:00:00 2001 From: David Robertson Date: Tue, 23 Aug 2022 15:35:59 +0100 Subject: [PATCH 1/7] Write about the chain cover a little. --- docs/auth_chain_difference_algorithm.md | 40 ++++++++++++++++++++----- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/docs/auth_chain_difference_algorithm.md b/docs/auth_chain_difference_algorithm.md index 30f72a70dae1..7cd49b0b40ba 100644 --- a/docs/auth_chain_difference_algorithm.md +++ b/docs/auth_chain_difference_algorithm.md @@ -34,13 +34,38 @@ the process of indexing it). ## Chain Cover Index Synapse computes auth chain differences by pre-computing a "chain cover" index -for the auth chain in a room, allowing efficient reachability queries like "is -event A in the auth chain of event B". This is done by assigning every event a -*chain ID* and *sequence number* (e.g. `(5,3)`), and having a map of *links* -between chains (e.g. `(5,3) -> (2,4)`) such that A is reachable by B (i.e. `A` +for the auth chain in a room, allowing us to efficiently make reachability queries +like "is event `A` in the auth chain of event `B`?". We could do this with an index +that tracks all pairs `(A, B)` such that `A` is in the auth chain of `B`. However this +would be prohibitively large, scaling poorly as the room accumulates more state +events. + +Instead, we break down the graph into *chains*. A chain is a subset of a DAG +with the following property: for any pair of events `E` and `F` in the chain, +the chain contains a path `E -> F` or a path `F -> E`. If we ensure that each +persisted event belongs to exactly one chain, we can keep answer reachability +queries by tracking of how the chains are connected to one another. Doing so +uses less storage than tracking this on an event-by-event basis, particularly +when we have fewer and longer chains. See + +> Jagadish, H. (1990). [A compression technique to materialize transitive closure](https://doi.org/10.1145/99935.99944). +> *ACM Transactions on Database Systems (TODS)*, 15*(4)*, 558-598. + +for the original idea or + +> Y. Chen, Y. Chen, [An efficient algorithm for answering graph +> reachability queries](https://doi.org/10.1109/ICDE.2008.4497498), +> in: 2008 IEEE 24th International Conference on Data Engineering, April 2008, +> pp. 893–902. (PDF available via [Google Scholar](https://scholar.google.com/scholar?q=Y.%20Chen,%20Y.%20Chen,%20An%20efficient%20algorithm%20for%20answering%20graph%20reachability%20queries,%20in:%202008%20IEEE%2024th%20International%20Conference%20on%20Data%20Engineering,%20April%202008,%20pp.%20893902.).) + +for a more modern take. + +In practical terms, the chain cover assigns every event a +*chain ID* and *sequence number* (e.g. `(5,3)`), and maintains a map of *links* +between chains (e.g. `(5,3) -> (2,4)`) such that `A` is reachable by `B` (i.e. `A` is in the auth chain of `B`) if and only if either: -1. A and B have the same chain ID and `A`'s sequence number is less than `B`'s +1. `A` and `B` have the same chain ID and `A`'s sequence number is less than `B`'s sequence number; or 2. there is a link `L` between `B`'s chain ID and `A`'s chain ID such that `L.start_seq_no` <= `B.seq_no` and `A.seq_no` <= `L.end_seq_no`. @@ -49,8 +74,9 @@ There are actually two potential implementations, one where we store links from each chain to every other reachable chain (the transitive closure of the links graph), and one where we remove redundant links (the transitive reduction of the links graph) e.g. if we have chains `C3 -> C2 -> C1` then the link `C3 -> C1` -would not be stored. Synapse uses the former implementations so that it doesn't -need to recurse to test reachability between chains. +would not be stored. Synapse uses the former implementation so that it doesn't +need to recurse to test reachability between chains. This trade-offs extra storage +in order to save CPU cycles and DB queries. ### Example From 238d9b114052f196ea220ce4ed3ba54366ae60a9 Mon Sep 17 00:00:00 2001 From: David Robertson Date: Tue, 23 Aug 2022 15:38:15 +0100 Subject: [PATCH 2/7] Changelog --- changelog.d/13602.doc | 1 + 1 file changed, 1 insertion(+) create mode 100644 changelog.d/13602.doc diff --git a/changelog.d/13602.doc b/changelog.d/13602.doc new file mode 100644 index 000000000000..dbba08216321 --- /dev/null +++ b/changelog.d/13602.doc @@ -0,0 +1 @@ +Improve the description of the ["chain cover index"](https://matrix-org.github.io/synapse/latest/auth_chain_difference_algorithm.html) used internally by Synapse. From 40cfdcd95a6544aae04f59ba20b9080da7f04c5b Mon Sep 17 00:00:00 2001 From: David Robertson Date: Tue, 23 Aug 2022 16:13:31 +0100 Subject: [PATCH 3/7] Batch of suggestions from review, thanks Sean Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> --- docs/auth_chain_difference_algorithm.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/auth_chain_difference_algorithm.md b/docs/auth_chain_difference_algorithm.md index 7cd49b0b40ba..a27f8a793c49 100644 --- a/docs/auth_chain_difference_algorithm.md +++ b/docs/auth_chain_difference_algorithm.md @@ -42,9 +42,10 @@ events. Instead, we break down the graph into *chains*. A chain is a subset of a DAG with the following property: for any pair of events `E` and `F` in the chain, -the chain contains a path `E -> F` or a path `F -> E`. If we ensure that each -persisted event belongs to exactly one chain, we can keep answer reachability -queries by tracking of how the chains are connected to one another. Doing so +the chain contains a path `E -> F` or a path `F -> E`. Synapse ensures that each +persisted event belongs to exactly one chain, and tracks how the chains are +connected to one another. This allows us to efficiently answer reachability +queries. Doing so uses less storage than tracking this on an event-by-event basis, particularly when we have fewer and longer chains. See @@ -62,8 +63,8 @@ for a more modern take. In practical terms, the chain cover assigns every event a *chain ID* and *sequence number* (e.g. `(5,3)`), and maintains a map of *links* -between chains (e.g. `(5,3) -> (2,4)`) such that `A` is reachable by `B` (i.e. `A` -is in the auth chain of `B`) if and only if either: +between events in chains (e.g. `(5,3) -> (2,4)`) such that `A` is reachable by `B` +(i.e. `A` is in the auth chain of `B`) if and only if either: 1. `A` and `B` have the same chain ID and `A`'s sequence number is less than `B`'s sequence number; or @@ -75,7 +76,7 @@ each chain to every other reachable chain (the transitive closure of the links graph), and one where we remove redundant links (the transitive reduction of the links graph) e.g. if we have chains `C3 -> C2 -> C1` then the link `C3 -> C1` would not be stored. Synapse uses the former implementation so that it doesn't -need to recurse to test reachability between chains. This trade-offs extra storage +need to recurse to test reachability between chains. This trades-off extra storage in order to save CPU cycles and DB queries. ### Example From 1261cec2ec7f7922c7bc650fe055fd121d88ad39 Mon Sep 17 00:00:00 2001 From: David Robertson Date: Tue, 23 Aug 2022 16:21:27 +0100 Subject: [PATCH 4/7] Rewrap Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> --- docs/auth_chain_difference_algorithm.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/auth_chain_difference_algorithm.md b/docs/auth_chain_difference_algorithm.md index a27f8a793c49..201524115cf9 100644 --- a/docs/auth_chain_difference_algorithm.md +++ b/docs/auth_chain_difference_algorithm.md @@ -45,9 +45,8 @@ with the following property: for any pair of events `E` and `F` in the chain, the chain contains a path `E -> F` or a path `F -> E`. Synapse ensures that each persisted event belongs to exactly one chain, and tracks how the chains are connected to one another. This allows us to efficiently answer reachability -queries. Doing so -uses less storage than tracking this on an event-by-event basis, particularly -when we have fewer and longer chains. See +queries. Doing so uses less storage than tracking this on an event-by-event +basis, particularly when we have fewer and longer chains. See > Jagadish, H. (1990). [A compression technique to materialize transitive closure](https://doi.org/10.1145/99935.99944). > *ACM Transactions on Database Systems (TODS)*, 15*(4)*, 558-598. From 9b181bf70afd7c18c15fea1a02e0edf51cd740a7 Mon Sep 17 00:00:00 2001 From: David Robertson Date: Tue, 23 Aug 2022 16:34:52 +0100 Subject: [PATCH 5/7] Expand on sequence numbers a little --- docs/auth_chain_difference_algorithm.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/docs/auth_chain_difference_algorithm.md b/docs/auth_chain_difference_algorithm.md index 201524115cf9..1054cdd3d6fa 100644 --- a/docs/auth_chain_difference_algorithm.md +++ b/docs/auth_chain_difference_algorithm.md @@ -36,17 +36,23 @@ the process of indexing it). Synapse computes auth chain differences by pre-computing a "chain cover" index for the auth chain in a room, allowing us to efficiently make reachability queries like "is event `A` in the auth chain of event `B`?". We could do this with an index -that tracks all pairs `(A, B)` such that `A` is in the auth chain of `B`. However this +that tracks all pairs `(A, B)` such that `A` is in the auth chain of `B`. However, this would be prohibitively large, scaling poorly as the room accumulates more state events. Instead, we break down the graph into *chains*. A chain is a subset of a DAG with the following property: for any pair of events `E` and `F` in the chain, -the chain contains a path `E -> F` or a path `F -> E`. Synapse ensures that each -persisted event belongs to exactly one chain, and tracks how the chains are -connected to one another. This allows us to efficiently answer reachability -queries. Doing so uses less storage than tracking this on an event-by-event -basis, particularly when we have fewer and longer chains. See +the chain contains a path `E -> F` or a path `F -> E`. Each event in the chain +is given a *sequence number* local to that chain. The oldest event `E` in the +chain has sequence number 1. If `E` has a child in the chain, the child has +sequence number 2; if `E` has a grandchild, the grandchild has sequence number +3; and so on. + +Synapse ensures that each persisted event belongs to exactly one chain, and +tracks how the chains are connected to one another. This allows us to +efficiently answer reachability queries. Doing so uses less storage than +tracking reachability on an event-by-event basis, particularly when we have +fewer and longer chains. See > Jagadish, H. (1990). [A compression technique to materialize transitive closure](https://doi.org/10.1145/99935.99944). > *ACM Transactions on Database Systems (TODS)*, 15*(4)*, 558-598. From 2b1216ac4f201340ac1f8cb45d07b6f76dad2e84 Mon Sep 17 00:00:00 2001 From: David Robertson Date: Tue, 23 Aug 2022 17:00:48 +0100 Subject: [PATCH 6/7] One last tweak --- docs/auth_chain_difference_algorithm.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/auth_chain_difference_algorithm.md b/docs/auth_chain_difference_algorithm.md index 1054cdd3d6fa..93de4e81fbed 100644 --- a/docs/auth_chain_difference_algorithm.md +++ b/docs/auth_chain_difference_algorithm.md @@ -42,11 +42,12 @@ events. Instead, we break down the graph into *chains*. A chain is a subset of a DAG with the following property: for any pair of events `E` and `F` in the chain, -the chain contains a path `E -> F` or a path `F -> E`. Each event in the chain +the chain contains a path `E -> F` or a path `F -> E`. This forces a chain to be +linear (without forks) e.g. `E -> F -> G -> ... -> H`. Each event in the chain is given a *sequence number* local to that chain. The oldest event `E` in the -chain has sequence number 1. If `E` has a child in the chain, the child has -sequence number 2; if `E` has a grandchild, the grandchild has sequence number -3; and so on. +chain has sequence number 1. If `E` has a child `F` in the chain, then `F` has +sequence number 2. If `E` has a grandchild `G` in the chain, then `G` has +sequence number 3; and so on. Synapse ensures that each persisted event belongs to exactly one chain, and tracks how the chains are connected to one another. This allows us to From 3bc9c3d14f66d1051a088d7c825a780b984e8480 Mon Sep 17 00:00:00 2001 From: David Robertson Date: Tue, 23 Aug 2022 17:46:39 +0100 Subject: [PATCH 7/7] Update docs/auth_chain_difference_algorithm.md Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> --- docs/auth_chain_difference_algorithm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/auth_chain_difference_algorithm.md b/docs/auth_chain_difference_algorithm.md index 93de4e81fbed..ebc9de25b880 100644 --- a/docs/auth_chain_difference_algorithm.md +++ b/docs/auth_chain_difference_algorithm.md @@ -43,7 +43,7 @@ events. Instead, we break down the graph into *chains*. A chain is a subset of a DAG with the following property: for any pair of events `E` and `F` in the chain, the chain contains a path `E -> F` or a path `F -> E`. This forces a chain to be -linear (without forks) e.g. `E -> F -> G -> ... -> H`. Each event in the chain +linear (without forks), e.g. `E -> F -> G -> ... -> H`. Each event in the chain is given a *sequence number* local to that chain. The oldest event `E` in the chain has sequence number 1. If `E` has a child `F` in the chain, then `F` has sequence number 2. If `E` has a grandchild `G` in the chain, then `G` has