From 33e48ff720790ea2be5187c8323d1ed2ad40a3ae Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 8 Jan 2016 20:40:51 +0100 Subject: [PATCH 1/6] Describe CBOR tagging --- merkledag/ipld.md | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index bb30bbd63..8f3adba35 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -308,11 +308,37 @@ On the subject of integers, there exist a variety of formats which represent int IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. +## Serialised CBOR with tags + +IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4): + +- ``: **[If key escaping is necessary]** The string that follows (major type 2 or 3) is interpreted as an escaped string (of the same major type). Every occurrences of `\` are considered to be `\\`, and every occurrences of `@` are considered to be `\@`. + +- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string. + +- ``: the text string (major type 2) that follows (or the byte string tagged with ``) is to be interpreted with "`/ipfs/`" added in front of the string. + +- ``: an array (major type 4) must follow. The array must have two elements: a text string (or a byte string tagged using ``) followed by a map (major type 5). This whole must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string contained in the array as value. + +**FIXME:** register tags with IANA. + +When encoding an IPLD node to CBOR with tags, these tags must be included whenever possible, and avoided if not necessary. This will ensure a unique encoding across implementations. More specifically (and in this order): + +- If map key is a text string `s` and `escape(unescape(s)) == s`, then this string is transformed to `unescape(s)` and tagged with `` + +- if a text string starts with "`/ipfs/`", this prefix is removed and the string is tagged with ``. + +- If a text string contains a valid base58 encoded value, it is converted to a binary string and tagged with `` + +- If a map contains an entry which key is the text string "`link`", this entry is removed from the map, an array is created containing the entry value and the map, and this array is prefixed by the tag ``. The result is used in place of the map. + +When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. + ### Canonical Format In order to preserve merkle-linking's power, we must ensure that there is a single **_canonical_** serialized representation of an IPLD document. This ensures that applications arrive at the same cryptographic hashes. It should be noted --though-- that this is a system-wide parameter. Future systems might change it to evolve representations. However we estimate this would need to be done no more than once per decade. -**The IPLD Canonical format is _canonicalized CBOR_.** +**The IPLD Canonical format is _canonicalized CBOR with tags_.** The legacy canonical format is protocol buffers. From 9c22d9a1005ae65a61ce423eae613c73490a8dd4 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sun, 10 Jan 2016 15:44:53 +0100 Subject: [PATCH 2/6] CBOR tagging: simplify tagging and remove key escapes management --- merkledag/ipld.md | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 8f3adba35..bb7a7bef1 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -312,25 +312,38 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4): -- ``: **[If key escaping is necessary]** The string that follows (major type 2 or 3) is interpreted as an escaped string (of the same major type). Every occurrences of `\` are considered to be `\\`, and every occurrences of `@` are considered to be `\@`. +- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string (using the IPFS alphabet). -- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string. +- ``: an array (major type 4) of two or three elements (link prefix, link hash (optional) and map) must follow: -- ``: the text string (major type 2) that follows (or the byte string tagged with ``) is to be interpreted with "`/ipfs/`" added in front of the string. + - The link prefix must be an integer representing the first path of the link, or a text string appended at the beginning of the link. Available integers are: + - `1`: represents the prefix `/ipfs/` -- ``: an array (major type 4) must follow. The array must have two elements: a text string (or a byte string tagged using ``) followed by a map (major type 5). This whole must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string contained in the array as value. + - The link hash, either a text string to be appended after the link prefix, or a tag `` followed by the binary string representing the hash digest. -**FIXME:** register tags with IANA. + - a map -When encoding an IPLD node to CBOR with tags, these tags must be included whenever possible, and avoided if not necessary. This will ensure a unique encoding across implementations. More specifically (and in this order): + The link value is constructed by concatenating the link prefix and the link hash (if present) in its text form. -- If map key is a text string `s` and `escape(unescape(s)) == s`, then this string is transformed to `unescape(s)` and tagged with `` + This must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string representing the link formed by the first two elements of the array. When iterating over the map, this entry must appear first. -- if a text string starts with "`/ipfs/`", this prefix is removed and the string is tagged with ``. +**TODO:** -- If a text string contains a valid base58 encoded value, it is converted to a binary string and tagged with `` +- [ ] register tags with IANA. +- [ ] specify tags we use for escaping (if we want to store escaped string in unescaped form) -- If a map contains an entry which key is the text string "`link`", this entry is removed from the map, an array is created containing the entry value and the map, and this array is prefixed by the tag ``. The result is used in place of the map. +When encoding an IPLD node to CBOR with tags, some conversion steps are necessary (in this order): + +- if a map contains an entry which key is the text string `link` and the value is a text string, the map is converted to a link object: + + - the `link` entry is removed from the map + - if the link value cannot be split in a prefix and a base58 suffix, an array is created with the link value (a text string) and the transformed map. + - else, the link is split in a textual prefix and a base58 binary digest (the base58 value is decoded) and an array with the prefix, the `` followed by the binary hash, and the transformed map is created + - the original map is transformed to a `` followed by the array just created + +- if a text string is a canonical base58 representation of a binary string, the text string is converted to binary and `` is added at the beginning + +- if a text string is a canonical base64 representation (with no stray characters) of a binary string, the text string is converted to binary and the tag `22` (defined in RFC7049) is added at the beginning When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. From ccadf1d3f023a178c0813e029d9def2bb5baffb5 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Tue, 9 Feb 2016 23:02:42 +0100 Subject: [PATCH 3/6] Simple CBOR format --- merkledag/ipld.md | 38 +++++++++++++------------------------- 1 file changed, 13 insertions(+), 25 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index bb7a7bef1..cca5526aa 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -310,42 +310,28 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// ## Serialised CBOR with tags -IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4): +IPLD objects can be represented using cbor using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). -- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string (using the IPFS alphabet). +A tag `` is defined. This tag must be followed by an array (major type 4) containing two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5). -- ``: an array (major type 4) of two or three elements (link prefix, link hash (optional) and map) must follow: +When encoding an IPLD object to CBOR, every map that contain a link key is transformed to a `` followed by the array containing the link and then containing the CBOR version of the map without the link key. - - The link prefix must be an integer representing the first path of the link, or a text string appended at the beginning of the link. Available integers are: - - `1`: represents the prefix `/ipfs/` +- if the link key is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is stored as a binary multiaddress as the first array item. +- else, the link is stored as text as the first array item. - - The link hash, either a text string to be appended after the link prefix, or a tag `` followed by the binary string representing the hash digest. +When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed. - - a map +- If the first array item is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. +- The map that follows is augmented with a new pair. The key is the standard IPLD link property, the value is the link in its textual format. +- When iterating over this augmented map, the link property must come first and not in any other order. This guarantee a consistent ordering. +- This augmented map is used instead of the `` in the IPLD output. - The link value is constructed by concatenating the link prefix and the link hash (if present) in its text form. - - This must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string representing the link formed by the first two elements of the array. When iterating over the map, this entry must appear first. +When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. **TODO:** - [ ] register tags with IANA. -- [ ] specify tags we use for escaping (if we want to store escaped string in unescaped form) - -When encoding an IPLD node to CBOR with tags, some conversion steps are necessary (in this order): - -- if a map contains an entry which key is the text string `link` and the value is a text string, the map is converted to a link object: - - - the `link` entry is removed from the map - - if the link value cannot be split in a prefix and a base58 suffix, an array is created with the link value (a text string) and the transformed map. - - else, the link is split in a textual prefix and a base58 binary digest (the base58 value is decoded) and an array with the prefix, the `` followed by the binary hash, and the transformed map is created - - the original map is transformed to a `` followed by the array just created -- if a text string is a canonical base58 representation of a binary string, the text string is converted to binary and `` is added at the beginning - -- if a text string is a canonical base64 representation (with no stray characters) of a binary string, the text string is converted to binary and the tag `22` (defined in RFC7049) is added at the beginning - -When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. ### Canonical Format @@ -353,6 +339,8 @@ In order to preserve merkle-linking's power, we must ensure that there is a sing **The IPLD Canonical format is _canonicalized CBOR with tags_.** +Users of this format should not expect any specific ordering of the keys, as the keys might be ordered differently in non canonical formats. + The legacy canonical format is protocol buffers. This canonical format is used to decide which format to use when creating the object for the first time and computing its hash. Once the format is decided for an IPLD object, it must be used in all communications so senders and receivers can check the data against the hash. From 7619b64e5305ef74893c144bbe02a1a39f365c07 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Wed, 10 Feb 2016 12:55:43 +0100 Subject: [PATCH 4/6] Change wording and don't store an empty map when links have no attributes --- merkledag/ipld.md | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index cca5526aa..2c8d7bf40 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -310,27 +310,35 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// ## Serialised CBOR with tags -IPLD objects can be represented using cbor using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). +IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). -A tag `` is defined. This tag must be followed by an array (major type 4) containing two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5). +A tag `` is defined. This tag must be followed by an array (major type 4) containing one or two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5) and can be omitted if the map is empty. The canonical format is to omit this map if it is empty. -When encoding an IPLD object to CBOR, every map that contain a link key is transformed to a `` followed by the array containing the link and then containing the CBOR version of the map without the link key. +When encoding an IPLD object to CBOR, every IPLD object can be considered to be encoded using `` using this algorithm: -- if the link key is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is stored as a binary multiaddress as the first array item. -- else, the link is stored as text as the first array item. +- If the IPLD object doesn't contain a link property, it is encoded in CBOR as a map. +- If the IPLD object contain a link property but it is not a string, it is encoded in CBOR as a map. +- The link property is extracted and the object is converted to a map that don't contain the link. +- If the link is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). +- Else, the link is stored as text (major type 3) +- A CBOR array is constructed containing the link as first item +- If the map created earlier is not empty, the map is added to the array as its second item +- The array is prefixed by the ``, this is the final CBOR representation of a link. -When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed. +When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed by the following algorithm: - If the first array item is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. -- The map that follows is augmented with a new pair. The key is the standard IPLD link property, the value is the link in its textual format. -- When iterating over this augmented map, the link property must come first and not in any other order. This guarantee a consistent ordering. -- This augmented map is used instead of the `` in the IPLD output. +- If the array contains a second item (which should be a map), it is extracted. Else an empty map is created. +- The map is augmented with a new key value pair. The key is the standard IPLD link property, the valus is the string containing the link. +- This map should be interpreted as an IPLD object instead of the tag. +- When iterating over the map in its canonical form, the link must be come before every other key even if the canonical CBOR order says otherwise. + +When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers should be able to use an optimized reading process to detect links using these tags. -When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. **TODO:** -- [ ] register tags with IANA. +- [ ] register tag with IANA. ### Canonical Format @@ -339,6 +347,8 @@ In order to preserve merkle-linking's power, we must ensure that there is a sing **The IPLD Canonical format is _canonicalized CBOR with tags_.** +The canonical CBOR format must follow rules defines in [RFC 7049 section 3.9](http://tools.ietf.org/html/rfc7049#section-3.9) in addition to the rules defined here. + Users of this format should not expect any specific ordering of the keys, as the keys might be ordered differently in non canonical formats. The legacy canonical format is protocol buffers. From 1a3f4a99cfb85d047741892c5554e16d897f9ba5 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Wed, 10 Feb 2016 13:15:13 +0100 Subject: [PATCH 5/6] Don't require the tag to be followed by an array if there are no properties --- merkledag/ipld.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 2c8d7bf40..4acdc42ce 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -312,7 +312,10 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). -A tag `` is defined. This tag must be followed by an array (major type 4) containing one or two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5) and can be omitted if the map is empty. The canonical format is to omit this map if it is empty. +A tag `` is defined. This tag can be followed by: + +- a text string (major type 3) or byte string (major type 2) corresponding to the link target. This is the canonical format for links with no link properties. +- an array (major type 4) containing as first element the link target (text or binary string) and as optional second argument the link properties (a map, major type 5) When encoding an IPLD object to CBOR, every IPLD object can be considered to be encoded using `` using this algorithm: @@ -321,15 +324,15 @@ When encoding an IPLD object to CBOR, every IPLD object can be considered to be - The link property is extracted and the object is converted to a map that don't contain the link. - If the link is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). - Else, the link is stored as text (major type 3) -- A CBOR array is constructed containing the link as first item -- If the map created earlier is not empty, the map is added to the array as its second item -- The array is prefixed by the ``, this is the final CBOR representation of a link. +- If the map created earlier is empty, the resulting encoding is the `` followed by the CBOR representation of the link +- If the map is not empty, the resulting encoding is the `` followed by an array of two elements containing the link followed by the map -When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed by the following algorithm: +When decoding CBOR and converting it to IPLD, each occurences of `` is transformed by the following algorithm: -- If the first array item is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. -- If the array contains a second item (which should be a map), it is extracted. Else an empty map is created. -- The map is augmented with a new key value pair. The key is the standard IPLD link property, the valus is the string containing the link. +- If the following value is an array, its elements are extracted. First the link followed by the link properties. If there are no link properties, an empty map is used instead. +- Else, the following value must be the link, which is extracted. The link properties are created as an empty map. +- If the link is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. +- The map of the link properties is augmented with a new key value pair. The key is the standard IPLD link property, the value is the textual string containing the link. - This map should be interpreted as an IPLD object instead of the tag. - When iterating over the map in its canonical form, the link must be come before every other key even if the canonical CBOR order says otherwise. From dfb8903b182586a6e8ebdf764bbd4694ccc31161 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Thu, 11 Feb 2016 21:53:22 +0100 Subject: [PATCH 6/6] Minor edits --- merkledag/ipld.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 4acdc42ce..9e14a378f 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -308,7 +308,7 @@ On the subject of integers, there exist a variety of formats which represent int IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. -## Serialised CBOR with tags +### Serialised CBOR with tags IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). @@ -320,8 +320,8 @@ A tag `` is defined. This tag can be followed by: When encoding an IPLD object to CBOR, every IPLD object can be considered to be encoded using `` using this algorithm: - If the IPLD object doesn't contain a link property, it is encoded in CBOR as a map. -- If the IPLD object contain a link property but it is not a string, it is encoded in CBOR as a map. -- The link property is extracted and the object is converted to a map that don't contain the link. +- If the IPLD object contains a link property but it is not a string, it is encoded in CBOR as a map. +- The link property is extracted and the object is converted to a map that doesn't contain the link. - If the link is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). - Else, the link is stored as text (major type 3) - If the map created earlier is empty, the resulting encoding is the `` followed by the CBOR representation of the link