Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reforming dynamic lists in SSZ #1160

Closed
vbuterin opened this issue Jun 10, 2019 · 10 comments
Closed

Reforming dynamic lists in SSZ #1160

vbuterin opened this issue Jun 10, 2019 · 10 comments
Labels
milestone:June 30 freeze 🥶 Phase 0 spec freeze for long-lived cross-client testnet scope:SSZ Simple Serialize

Comments

@vbuterin
Copy link
Contributor

vbuterin commented Jun 10, 2019

Background reading: #1115

From my experience implementing SSZ partials (https:/ethereum/eth2.0-specs/blob/ssz-impl-rework/test_libs/pyspec/eth2spec/utils/ssz/ssz_partials.py), I've come to the conclusion that the fact that paths do not correspond to specific generalized indices (as generalized indices depend on the depth of a tree which for dynamic-sized lists is dynamic) leads to a large amount of complexity in the SSZ partials implementation. append and pop and the complexities around rebalancing lists around powers of 2 are particularly problematic.

My proposed solution to this is as follows. From the perspective of SSZ hashing, we remove lists, so the only data types are (i) base types, (ii) fixed-size vectors, (iii) containers (we could also add unions by hashing a Union[A, B... Z] identically to Container(a=Vector[A, 1], b=Vector[B, 1] ... z=Vector[Z, 1])).

All existing lists in the spec (validator list in the state, transaction lists in blocks...) get converted to fixed-size lists of some size (for lists in blocks, we know the maximums already; for the validator set we can pick a very generous limit, eg. 2**40). Hence, from the perspective of hashing, all data types become constant-sized types, and so the generalized index that a given path (eg. state -> state.validator_registry[123514].pubkey) corresponds to is always the same value.

However, in the SSZ vector data type, we now add a flag, "serialization type". To start off, the serialization types are FULL and DYNAMIC; a third one to be added later is SPARSE. A FULL vector is serialized like vectors are today; a DYNAMIC vector is serialized by serializing the items in the vector up until the highest nonzero item the same way that a list is serialized today. This makes DYNAMIC vectors essentially equivalent to current lists in terms of serialization, except that a maximum length must be specified. A SPARSE vector would be serialized by prepending the list of items with a (item, position) table of what all of the nonzero items are. Current vectors would become vectors with a FULL serialization type, current lists would become vectors with a DYNAMIC or possibly SPARSE serialization type (if DYNAMIC is used, then it may make sense to add a validation rule in those cases that verifies that there is no non-empty object that appears after any empty object in the list).

@dankrad
Copy link
Contributor

dankrad commented Jun 10, 2019

I would like to make an alternative proposal: Replace list by a type List[A, max_length], which is still a dynamic list type but with maximum length max_length.

  • Merkleization would be like a Vector but with the length mixed in (in other words: it is the hash tree root where on the left is a static vector filled with zeroes and on the right there is the length as a uint64)
  • Serialization is just like a List at the moment

The reason why I would propose this is that I think it is preferable that the type is aware of its current dynamic length. The code for this is much easier to write, reason about, and check statically and dynamically. It is also consistent with the other changes we are making to SSZ (like unions and bitlists) which go into the direction of externalising typing work into SSZ and making it work "naturally" as you would expect from a modern programming language.

@vbuterin
Copy link
Contributor Author

Merkleization would be like a Vector but with the length mixed in (in other words: it is the hash tree root where on the left is a static vector filled with zeroes and on the right there is the length as a uint64)

I can see how this can be valuable. But there is one question: do we want length mixed in as an item at the top, or as an element of the list? Huffman theory would say that mixing length in at the top is only optimal if half of all queries to the list are queries of its length, but that seems implausibly high. It would also say that reserving the first element of the array to represent length is only optimal if a length query is only as common as a query to a randomly selected element, which seems implausibly low. Which of these two extremes is less bad? The former is better from the PoV of not changing things as much as possible, the latter is better from the PoV of simplicity.

@dankrad
Copy link
Contributor

dankrad commented Jun 10, 2019

Huffman theory would say that mixing length in at the top is only optimal if half of all queries to the list are queries of its length, but that seems implausibly high.

I'm not sure, but if we assume that Merkle proofs typically access one element, and every proof hast to come with a proof of the length of the list, then actually half the accesses are to the length.
(We could skip this if we use zero-checking logic instead of checking the length, but this might need nasty special cases for lists where zero is a valid value. Not what I would prefer from a simplicity point of view)

@vbuterin
Copy link
Contributor Author

and every proof hast to come with a proof of the length of the list

Ah, I suppose if we want our list get proofs to return what list[i] would actually return, which is IndexError if the index is out of bounds, then you're right. OK, I can see the rationale for mixing in at the top as in the status quo.

@protolambda
Copy link
Collaborator

protolambda commented Jun 12, 2019

See if we can describe the new extended SSZ better/more minimally (or at least my idea of it, all just an idea):


Any collection is defined as <header><body>

<body> is always defined as the concatenation of the serialized elements in the collection.

Bitfields require the <body> to be exactly bit-length bits, padded to the next multiple of 8. With the right-most byte left-padded to a full byte.
BigInts are the same as bitfields, but interpreted as little-endian integer.

There are 4 different types, specialized in <header> and hash-tree-root definition:

  • Known, header: `` (zero bytes), root: merkle(pad_right(elements, M))
    Body size and elements location are known on compile time.
    • fixed-size element containers
    • fixed-size element full vectors
  • Dynamic, root: H(merkle(pad_right(elements, M)), len)
    • Implicit, header: `` (zero bytes)
      Body size known from context (decoders use read bound)
      • fixed-size element dynamic vectors (a "list")
      • Bitfield (new)
    • Explicit: <offset>*
      • variable-size element containers
      • variable-size element full vectors
      • dynamic vectors (a "list")
  • Sparse, root: merkle(pad_sparse(elements, indices, M)):
    Indirectly: header: <<index><offset>>*:
    - Sparse variable-size element vector (new)
    Directly: header: <index>*
    - Sparse fixed-size element vector (new)
  • Selected: header: <index>, root: H(H(element), index):
    • Option/Union

* indicates zero or more elements, concatenated.

<index>, <offset> and <length> are always 4 byte little endian integers (i.e. uint32).

Offsets always count the bytes up to (but not including) the element.

M is defined as the power of 2 defined as collection limit.

merkle chunkifies (partition elements into 32 byte chunks) the input if it is of a basic type or a bitfield.
merkle computes the binary-merkle-root of the given elements.

pad_right(elements, M) pads the elements with zeroes to the given power of 2. **
pad_sparse pads the gaps and ends to match the given power of 2. **

**: Optimized away within merkleization in real-world implementations.


  • note on Sparse-Indirectly: we can do <index>*<offset>* too, if we want to merkle just part of the header (<index>*) efficiently for basic presence proofs (not quite like hard inclusion proofs still) constructed in limited contexts (contracts?).
  • note on <body>: encoders for bodies can be remixed with different header types to implement all the types easily:
    • read by offsets
    • read by stride (fixed increment per element)
    • read by element size (variable increment per element).

@dankrad
Copy link
Contributor

dankrad commented Jun 12, 2019

I'm very confused by your post... at least it does not feel like a simplification to me ;)
Have we defined sparse objects yet?
Also the bitfields is not how it works (we pad them on the left)

Dynamic, root: H(len, merkle(pad_right(elements, M)))

Mixing in the length is always on the right

@protolambda
Copy link
Collaborator

Have we defined sparse objects yet?

No, but I posted this serialization earlier in the chat, as it is consistent with offsets, and simple. But improvements are very welcome. (also see note on ordering of indices)

Also the bitfields is not how it works (we pad them on the left)

See edits, that is what I did previously. But that's not consistent with the current verify-bitfield behavior, which is like a little endian int. Also, a big little endian integer sounds more consistent too me. But considered big-endian for formatting/reading reasons (prefer it too, as there's no "gap" in the bits when you align a 9 bit integer in bytes)

Mixing in the length is always on the right

Whoops, wrote it a bit quick, but you get the idea

@dankrad
Copy link
Contributor

dankrad commented Jun 13, 2019

But that's not consistent with the current verify-bitfield behavior, which is like a little endian int.

Oh yes, forgot it's little endian. Then it should be on the right.

@JustinDrake JustinDrake added the milestone:June 30 freeze 🥶 Phase 0 spec freeze for long-lived cross-client testnet label Jun 15, 2019
@ethers
Copy link
Member

ethers commented Jun 20, 2019

I may add information here to help others who just arrive.

What is an SSZ Partial?

An SSZ partial is an object that can stand in for an SSZ object in any function, but which only contains some of the elements in the SSZ object. It also contains Merkle proofs that prove that the values included in the SSZ partial actually are the values from the original SSZ object; this can be verified by computing the Merkle root of the SSZ partial and verifying that it matches the root of the original SSZ object.

Source: https:/ethereum/research/tree/master/ssz_research/partials

@JustinDrake
Copy link
Collaborator

Addressed in #1180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
milestone:June 30 freeze 🥶 Phase 0 spec freeze for long-lived cross-client testnet scope:SSZ Simple Serialize
Projects
None yet
Development

No branches or pull requests

5 participants