Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSZ encoding/decoding for bit lists, bit vectors, recursive types. #1630

Closed
franck44 opened this issue Feb 26, 2020 · 12 comments
Closed

SSZ encoding/decoding for bit lists, bit vectors, recursive types. #1630

franck44 opened this issue Feb 26, 2020 · 12 comments
Labels
scope:SSZ Simple Serialize

Comments

@franck44
Copy link

I have tried to understand how bit vectors and bit lists (and other vectors and lists) are serialised but I am afraid I am missing a few steps.
Forgive me if these are silly questions and feel free to ignore and close if not relevant.

First, I had a look at "the implementations" py-ssz and cava and found a few differences compared to the simple-serialize-md guidelines:

  1. for bit vectors, the py-ssz tests seem to consider bitvectors of size which are multiple of 8, which makes the deserialisation easier. The cava implementation does not seem to deal with bit vectors (and not any vectors).
    How is the actual length of a bit vector encoded so that we can deserialise correctly?
    For instance, bit vector(true, true) is encoded into 0x03 in py-ssz and we probably want to decode it as (true, true) but the length is missing in the encoding.

  2. what is the status of lists? According to issue Reforming dynamic lists in SSZ #1160, there was a discussion about removing lists from the SSZ legal types.

  3. lists should have homogeneous types. It does not seem to rule out lists of containers, and containers can have fields that are containers. simple-serialise.md refers to a recursive encoding but I could not find any example in py-ssz nor cava.

  4. fourth, Cava provides encoding for String but it does not seem to a type defined in simple-serialise.md. Is String a legal (basic) SSZ type?

  5. it is unclear to me why we need is_fixed_size (or not) and also get_size is not defined for variable sized types. When serialising, every list or vector has a given size. What is the purpose of is_fixed_size?

Where can we get more details about the efficient algorithms to encode these datatypes as sequences of bytes?

Thanks
Franck

@ralexstokes
Copy link
Member

How is the actual length of a bit vector encoded so that we can deserialise correctly?

The length of the given ssz object is known from the type definition. Refer to definitions in the phase 0 spec for examples of ssz objects. If you expect a bit vector, you also know its length so that you can determine which bits are {un,}set in the encoded data.

what is the status of lists?

we still have lists, the linked issue was resolved by adding a notion of "max length" which helps w/ stable merkelization.

lists should have homogeneous types.

they do; in general you have a List[T] where T is your "homogenous" type. T can be a container w/ different types of fields but at the level of the List type constructor, you have fields of all one type (namely T).

simple-serialise.md refers to a recursive encoding but I could not find any example in py-ssz nor cava.
are you looking for something like this? https:/ethereum/py-ssz/blob/master/ssz/tools/dump.py

Is String a legal (basic) SSZ type?

no String type in the spec; I'm not familiar enough with cava to speak to what they are doing...

What is the purpose of is_fixed_size?

These functions are part of the offset encoding where we first encode fixed parts, then the offsets of the variable parts in the remainder of the encoding and then the actual data of the variable parts themselves. this allows for efficient retrieval of various parts of the encoding w/o having to decode the entire blob of ssz data.

@franck44
Copy link
Author

Thanks a lot @ralexstokes for the detailed explanation.

I have just checked in more details some of "the implementations".
Py-ssz seems to follow the spec top encode bitList, but I am a bit concerned that Cava/Tuweni does not. They seem to encode a bitList as the length of the list as an Int32, followed by one Byte per Boolean value in the list.

Isn't the purpose of the SSZ spec to make sure clients can talk to each other?
[irrespective of whether they are written in python or java).

Regarding the fixed_size part, do I understand correctly:

  1. the fixed parts and the variable size parts are encoded separately?
  2. where is the actual algorithm in the spec that provides a clear description and examples of how that works?

@ralexstokes
Copy link
Member

the fixed parts and the variable size parts are encoded separately?

a given ssz object is encoded as a single binary blob, the important bit is just their relative ordering so that it can be efficient to find variable size (sub)elements of some object

where is the actual algorithm in the spec that provides a clear description and examples of how that works?

the serialization is here: https:/ethereum/eth2.0-specs/blob/dev/ssz/simple-serialize.md#vectors-containers-lists-unions

re: cava -- i'd open an issue w/ that repo if something seems off. that discussion does not belong in this repo which should contain client-agnostic concerns.

@franck44
Copy link
Author

franck44 commented Feb 27, 2020

thanks @ralexstokes.

a given ssz object is encoded as a single binary blob, the important bit is just their relative ordering so that it can be efficient to find variable size (sub)elements of some object

where is the actual algorithm in the spec that provides a clear description and examples of how that works?

the serialization is here: https:/ethereum/eth2.0-specs/blob/dev/ssz/simple-serialize.md#vectors-containers-lists-unions

yes I had a look at that, but I must admit it is not crystal clear.
First it seems to generate a list of (list of bytes) rather than a list of bytes.
Second, is it not clear to me how it deals with recursive objects.
It seems to deal with sequences of objects which is a different thing.
As a matter of fact, I am unable to "interpolate" from that description how to encode a recursive type.

It is also unclear to me what the deserialiser is supposed to do.
Does it take as input a seq<bytes> AND a target type to deserialise into or just just a seq<bytes>?

re: cava -- i'd open an issue w/ that repo if something seems off. that discussion does not belong in this repo which should contain client-agnostic concerns.

sure. But the simple-serialise.md references "the implementations" (including Cava) to find efficient algorithms that can serialise. And as not much information is available in the simple-serialise.md file, "the implemenations" are the next port of call to help.
I have reported the issue to the Tuweni repo (Cava now seems to be included in Tuweni).

The specs may benefit from clearer definitions for the recursive types to make sure that every client implements the proper spec.

@franck44
Copy link
Author

@ralexstokes @MrChico
Found a bit more information on SSZ at
https:/protolambda/eth2-docs/blob/master/eth2-ssz.png

It is still a bit dry. I am trying to write a formal specification (by formal I mean formal methods) of the specs and I'd like to capture the actual specification properly.

@franck44
Copy link
Author

franck44 commented Mar 23, 2020

@ralexstokes @protolambda
I am trying to understand how to encode say List[T] but I must admit all the information I have managed to access did not help me to understand how it works (and this includes the implementations like py-ssz).

First, is there a proper definition (e.g. an algebraic datatype) of what a fixed-size type is?
Is it a type built from basic types and nesting vectors/containers with only fixed-size type?

Second, is there any example how the encoding works for say List[uint8]?
I have used protolambda diagrams as a starting point, but I am unable to understand how it works.

How is a List[uint8] encoded? For instance List[8, 111]?
How is a list of list encoded? For instance List[List[1],List[0,45. It seems from the py-ssz implementations that the hash/merkelisation of the elements in the top-level list are used.
Some real simple examples would help to understand how the SSZ machinery works.

Again I have tried to follow some links to get a better understanding but to avail.
For instance the go-ssz references seem to provide encodings for "slice, array, struct and pointer data types." These datatypes are not in the SSZ legal types.

Any help would be music appreciated.

@protolambda
Copy link
Collaborator

@franck44, hey, just got a notification, sorry I missed this issue earlier.

For some more reference on SSZ, check my repo (draft, but mostly complete): https:/protolambda/eth2.0-ssz/

Regarding encoding of lists: SSZ is a combination of 2 things to derive lengths: type information and scope in bytes. With scope I mean total byte count that is deserialized, or produced by serializing. And you can recurse deeper; the difference between offsets (or last offset and end of parent scope) is a scope.

Also, List[8, 111] is confusing: we use List[E, N] as notation for a list of E element type, and limit N.

Now, to encode a byte list: [8, 11] -> 080b
Or a list of uint16: 08000b00 (zero bytes last since it is little endian)

Now a bigger list: [8, 11, 3] -> 080b03
There are no delimiters, just 1 total scope of 3 bytes. Or 6 for E = uint16, etc.
Then simply divide byte length by element type length, and you get the element count. This works for every fixed-size element type list.

For dynamic elements, such as a list of lists, offsets are used: [[1, 2, 3], [4, 5]]

  • List[List[uint8, M], N] as type: 0000 0008 0000 000b 0102 0304 05
    • 0000 0008: first offset, 4 bytes, a uint32, pointing to the first element ([1, 2, 3]). Note that the offset includes the byte count used for offsets themselves.
      • from the first offset, the length of the list is derived: offsets themselves are fixed length. So element_count = first_offset // 4
    • 0000 000b: second offset, 4 bytes, a uint32, pointing to the second element ([4, 5])
    • 0102 0304 05: the contents
  • 010203 is sliced by the two offsets
  • 0405 is sliced by second offset + end of scope.
  • Note that the inner lists don't have offsets: their element type is fixed size (uint8), so the scope is sufficient information to determine the length, and elements are just located at element_size * i

Hope that explains the type info + offsets. Generally we do not nest small dynamic items much, so the offsets are fine and speed up lookups by a lot. Other than offsets (and selects for Union type), SSZ is completely free of runtime information, and everything can be derived from the types.

If you need more practical examples, there are some human readable test vectors here: https:/protolambda/remerkleable/blob/master/remerkleable/test_impl.py
And then a whole much bigger generated set in the Eth2 tests vectors repo.

For more low level code, you can also look at my ZSSZ library: github.com/protolambda/zssz
Or if you prefer safety and more type elegance, the nim and rust (of respectively lighthouse and nimbus) are good resources.

@protolambda
Copy link
Collaborator

protolambda commented Mar 23, 2020

Also, regarding Cava, I think it is outdated. We did not have offsets until some time in April 2019 or so.
Here's a more complete table with SSZ implementations: https:/protolambda/eth2.0-ssz/#implementations
And missing there is https:/protolambda/remerkleable
And note that the JS and Java (of Artemis, or now "Teku") have been changing a lot recently to move to an approach similar to remerkleable.

@franck44
Copy link
Author

@protolambda Thanks a lot for your quick reply. Very much appreciated.

Just to provide you with some context, we are trying to write a formal specification of SSZ/Merkleisation, some implementation of it and prove that it is correct (wrt spec).
At the moment, we have the basic types and bitlist specified and proved, and also have discovered a few bugs in the spec and the implementations.
I'll try to extend our current spec to Lists using your explanations.

The main problem we are facing is navigating the information ... there is a lot out there and sometimes not consistent (for instance the specs refers to the implementations for "efficient algorithms" but the implementations do not seem to agree on how to serialise/deserialise).
And most of the implementations are incomplete (they partially implement the spec).

To write a formal spec, we really need precise description of what a function should do (e.g. deserialise) rather than how it does it.
I have been trying to find out what deserialise should take as input but it is still unclear to me.
Is is a sequence of Bytes AND a target type (to deserialise into) or just a sequence of bytes (and the target type has to be inferred?).
There are quite a lot of very basic similar questions that do not seem to have obvious answers in the specs (I am not an Eth2 expert, so I may be incorrect).

@protolambda
Copy link
Collaborator

but the implementations do not seem to agree on how to serialise/deserialise

They definitely should, but on the surface of serialize/deserialize/hash-tree-root outputs. Internally they can organize however suites them best. There's no such thing as a "canonical representation" of data in memory. Only a local understanding of the type structure that it expresses. The type is the API and usage contract, not implementation. (the spec sets a bad example here, implementation is assumed to be done by clients their way, meeting the same outputs)

A client can choose how to deserialize, e.g. directly into a binary tree structure (see remerkleable) or into native data-types of the given language (e.g. Prysm has methods to deserialize into Go structs that also conform to protobuf). And the implementations are out of sync for different use-cases and niches, e.g. fast read/writes of small objects (e.g. ZSSZ and lighthouse SSZ are very fast here), or re-use of earlier computed merkle work when the state only changes slightly (e.g. remerkleable).

Most implementations (except the archived/outdated ones) specify what is necessary to stay in consensus with the eth2 beacon chain spec. Union for example is not implemented anywhere yet, since it's merely there, unused.

The current spec should be deterministic and unambiguous, although understandably hard to read and understand the finer details from. If you are looking for more description, the diagrams on that eth2-docs repo, and the eth2.0-ssz draft repo I linked should help.

I have been trying to find out what deserialise should take as input

  • The SSZ type is assumed, i.e. it's compile-time, something the reader of the data provides, not the data.
  • The byte-length (also described as scope) of the bytes to deserialize.
  • The actual bytes themselves.

Regard usage of these primitives:

  • If you know the byte length, you can even stream-decode the bytes into the in-memory structure. E.g. in the RPC the length of the SSZ bytes is encoded separately.
  • And encoding can be streamed too, since the length can be derived first, and very efficiently so from all the compile-time information (we have lots of fixed-length types, and the dynamic ones are not deep, so they are easily computed by multiplying run-time length with compile-time fixed element size). And as soon as that total byte length number is available, the data can be traversed in one go to write to some output. (See ZSSZ for an example)
  • Unlike JSON, SSZ is very static and can be optimized for during compile-time, and expected input sizes etc. can be computed in advance with exact bounds (dynamic length objects have limits). This is great for optimizing functions, and baking in protections against invalid inputs before actually reading any of the data.

do not seem to have obvious answers in the specs

Unfortunately the specs are more focused on being non-ambiguous yet practical, than explanatory. If you want to read background on choices in the eth2-specs in general I recommend https://notes.ethereum.org/@djrtwo/Bkn3zpwxB and https://benjaminion.xyz/eth2-annotated-spec/

Also, if you have any link of your work-in-progress (if open source?) please share :)

@protolambda
Copy link
Collaborator

@franck44 any questions left? Can I close this issue?

@franck44
Copy link
Author

@protolambda
yes thanks again, you may close this one.
I'll try to use the info to work out how to specify and implement serialise/deserialise lists of fixed size and variable size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope:SSZ Simple Serialize
Projects
None yet
Development

No branches or pull requests

3 participants