Relationship with Protocol Buffers legacy IPFS node format #59

mildred · 2016-01-07T09:32:24Z

In PR #37, we left out an important part of the spec aside: the relationship with protocol buffer serialization. This ought to be described as it has effects that may be far reaching.

TODO items:

Decide if we choose a format that requires path component escaping: no escaping needed
Decide which special key to use to avoid conflict with path component (@attrs in one proposition, with @ escaping, . in the other proposition): not needed

mildred · 2016-01-07T09:35:32Z

What is added:

Relationship with Protocol Buffers legacy IPFS node format

IPLD has a known conversion with the legacy Protocol Buffers format. This format is defined with the Protocol Buffers syntax as:

message PBLink {
    optional bytes  Hash = 1;
    optional string Name = 2;
    optional uint64 Tsize = 3;
}

message PBNode {
    repeated PBLink Links = 2;
    optional bytes  Data = 1;
}

The conversion to the IPLD data model must have the following properties:

It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way.
When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names.
Link names should not conflict with other keys.

There are multiple ways to do that that will be described next.

Current encoding in go-ipld

go-ipld implements the following conversion:

{
  "<Links[0].Name.(escaped)>": {
    "hash": "<Links[0].Hash>",
    "name": "<Links[0].Name>",
    "size": <Links[0].Tsize>
  },
  "<Links[1].Name.(escaped)>": {
    "hash": "<Links[1].Hash>",
    "name": "<Links[1].Name>",
    "size": <Links[1].Tsize>
  },
  ...
  "@attrs": {
    "data": "<Data>",
    "links": [
      {
        "hash": "<Links[0].Hash>",
        "name": "<Links[0].Name>",
        "size": <Links[0].Tsize>
      },
      {
        "hash": "<Links[1].Hash>",
        "name": "<Links[1].Name>",
        "size": <Links[1].Tsize>
      }
    ]
  }
}

Notes :

The links array in the @attrs section is there to preserve order and duplicate links to hash back to the exact same protocol buffer object.
The link names are escaped to prevent clashing with the @attr key.
The escaping mechanism transforms the @ character into \@. This mechanism also implies a modification of the path algorithm. When a path component contains the @ character, it must be escaped to look it up in the IPLD Node object.

For example, a path a path /root/first@component/second@component/third_component would look for object root["first\@component"]["second\@component"]["third_component"] (following mlinks when necessary).
Links are represented using the hash key instead of mlink as used in this specification. This must be changed.

Other proposition that does away with escaping

We can imagine another transformation where the link names are not escaped. For example:

{
  "<Links[0].Name>": {
    "mlink": "<Links[0].Hash>",
    "tsize": <Links[0].Tsize>
  },
  "<Links[1].Name>": {
    "mlink": "<Links[2].Hash>",
    "tsize": <Links[2].Tsize>
  },
  ...
  ".": {
    "data": "<Data>",
    "links": [
      "<Links[0].Name>",
      {
        "name": "<Link[1].Name>",
        "mlink": "<Links[1].Hash>",
        "tsize": <Links[1].Tsize>
      }
      "<Links[2].Name>",
      ...
    ]
  }
}

Notes:

Very conveniently, we use the key . to represent data for the current node, and any other key can represent a link. This means that we forbid link to be named .. This is in any case a good thing to do as the . element in paths can always be removed (the same way .. can be replaced by the parent directory)
No escaping is needed, and no modification to the path algorithm is needed.
Link order is kept by using the link names for links present in the top node. This avoid repeating identical data (even though it is probably generated on the fly).

Links that cannot be present in the top node (the case for the link named ., which is forbidden, or for links that are repeated with the same name) are present in full to allow reconstructing the exact protocol buffer object.

Other encodings

Speak up in comments if you have other encoding suggestions, or you have another implementation with another encoding.

jbenet · 2016-01-07T18:49:59Z

merkledag/ipld.md

+
+### Other proposition that does away with escaping
+
+We can imagine another transformation where the link names are not escaped. For example:


i dont like this one (using .) as much. i prefer the @attrs with escaping

Even considering there is no escaping?

jbenet · 2016-01-07T19:02:10Z

i think this should go in a separate doc that extends the spec, or as an example of a special case format.

we can have it inform (and modify) the proper spec, but i think including this format in the main document will confuse people? we could link to another doc.

jbenet · 2016-01-07T19:02:24Z

@mildred thanks for this!! 👍

mildred · 2016-01-08T07:34:44Z

I think this should go in a separate doc that extends the spec, or as an example of a special case format.

In case we go the route of escaping, this needs to be linked in other parts of the spec as it has impact on it.

mildred · 2016-01-08T08:19:29Z

Added considerations about escaping in the path section. See also PR #60 for similar changes.

the8472 · 2016-01-08T21:01:34Z

Based on IRC discussion and reading #60 it seems to me that traversing the IPLD structure should not be shoehorned into IP*S paths. It's more of a reflection / metadata thing for the paths themselves. Equivalent to stating a file.

Putting arbitrary keys at the top level of an associative array (hash) also scratches me the wrong way.

The metadata could look like this { attrs: {...}, namedlinks: {foo: ..., bar: ...}} while the file path still only contains "foo" without "namedlinks". If you want to access the metadata you could go through a separate namespace, e.g. /ipld/ipfs/<hash>/namedLinks/foo/attrs/

mildred · 2016-01-10T15:22:45Z

Putting arbitrary keys at the top level of an associative array (hash) also scratches me the wrong way.

Especially considering that unixfs paths are separate from IPLD paths according to last state of PR #62. There is nothing preventing the data to look like this as @the8472 told us:

{
  "data": "<Data>",
  "named-links": {
    "<Links[0].Name>": {
      "mlink": "<Links[0].Hash.(base58)>",
      "name": "<Links[0].Name>",
      "size": <Links[0].Tsize>
    },
    "<Links[1].Name>": {
      "mlink": "<Links[1].Hash.(base58)>",
      "name": "<Links[1].Name>",
      "size": <Links[1].Tsize>
    },
    ...
  }
  "ordered-links": [
    {
      "mlink": "<Links[0].Hash.(base58)>",
      "name": "<Links[0].Name>",
      "size": <Links[0].Tsize>
    },
    {
      "mlink": "<Links[1].Hash.(base58)>",
      "name": "<Links[1].Name>",
      "size": <Links[1].Tsize>
    }
  ]
}

Is there a reason why we couldn't do this way. @jbenet?

Links needs not to be present at the top level. having them in a separate map removes all complexity of key escaping.

mildred · 2016-01-10T15:43:25Z

Updated the conversion format.

mildred · 2016-01-20T20:52:11Z

merkledag/ipld-compat-protobuf.md

+The format is encapsulated after a multicodec header that tells which codec to use. In addition, older applications that do not yet use the multicodec header will transmit a protocol buffer stream. This can be detected by looking at the first byte:
+
+- if the first byte is between 0 and 127, it is a multicodec header
+- if the first byte if between 128 and 255, it is a protocol buffer stream


I believe I am wrong here. The multicodec header length is not limited to one byte, but can be encoded in multiple bytes if it is above 127. Setting the MSB to 1 is just the way varint works.

I also assumed that protocol buffers always started with a byte with MSB set to 1, but I don't know if that true. Probably not.

So, it's probably not possible to detect in a such easy way if we are transmitting a multicodec header or a protocol buffer message. I'm currently rewriting this part (as I am implementing it in go-ipld).

Fix the paragraph about the first byte that is able to determine if the data in prefixed by a multicodec or is a protocol buffer object.

jbenet · 2016-02-11T12:03:41Z

merkledag/ipld-compat-protobuf.md

+
+The conversion to the IPLD data model must have the following properties:
+
+- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way.


"- It MUST be ..."

jbenet · 2016-02-11T12:06:12Z

minor things, otherwise LGTM

mildred · 2016-02-12T09:24:11Z

@jbenet I pushed some fixes, also, what do you think of the ordered-links vs named-links issue ?

jbenet · 2016-02-12T10:46:39Z

#59 (comment)

mildred · 2016-02-12T11:53:21Z

@jbenet I removed the named links section and made it an option for the implementations as it is not an important part of the spec.

jbenet · 2016-02-12T12:25:18Z

SGTM! 👍 thanks

Relationship with Protocol Buffers legacy IPFS node format

jbenet added the backlog label Jan 7, 2016

mildred mentioned this pull request Jan 7, 2016

WIP: IPLD spec #37

Merged

5 tasks

jbenet reviewed Jan 7, 2016
View reviewed changes

mildred force-pushed the ipld-spec branch from 4a7d062 to e7e5c51 Compare January 8, 2016 08:18

mildred mentioned this pull request Jan 8, 2016

Separate filesystem merkle-path from IPLD merkle-path #60

Closed

3 tasks

mildred force-pushed the ipld-spec branch from e3c101c to f903dca Compare January 8, 2016 18:47

mildred mentioned this pull request Jan 9, 2016

IPLD CBOR tagging #61

Merged

1 task

mildred added 4 commits January 10, 2016 16:24

Relationship with Protocol Buffers legacy IPFS node format

0381888

Talk about escaping keys in merkle-paths

bc2050c

Move (and update) section about protobuf compat to separate file

89dd82d

Change protocol buffer compatibility format.

5b97e14

Links needs not to be present at the top level. having them in a separate map removes all complexity of key escaping.

mildred force-pushed the ipld-spec branch from f903dca to 5b97e14 Compare January 10, 2016 15:42

mildred reviewed Jan 20, 2016
View reviewed changes

mildred mentioned this pull request Jan 24, 2016

Implement the IPLD spec ipld/go-ipld-deprecated#20

Open

8 tasks

IPLD Protocol Buffer compatibility: fix errors

33ca56e

Fix the paragraph about the first byte that is able to determine if the data in prefixed by a multicodec or is a protocol buffer object.

mildred force-pushed the ipld-spec branch from 86205a8 to ae592e3 Compare February 9, 2016 21:07

Only keep first alternative.

1ab421b

Do not make use of escaping

d1ceeb3

mildred force-pushed the ipld-spec branch from ae592e3 to d1ceeb3 Compare February 9, 2016 21:10

jbenet reviewed Feb 11, 2016
View reviewed changes

Minor wording tweaks

c845223

Remove named links section (but leave the possibility open)

ffa001e

jbenet added a commit that referenced this pull request Feb 12, 2016

Merge pull request #59 from mildred/ipld-spec

fc38955

Relationship with Protocol Buffers legacy IPFS node format

jbenet merged commit fc38955 into ipfs:ipld-spec Feb 12, 2016

jbenet removed the backlog label Feb 12, 2016

daviddias added the IPLD label Mar 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relationship with Protocol Buffers legacy IPFS node format #59

Relationship with Protocol Buffers legacy IPFS node format #59

mildred commented Jan 7, 2016

mildred commented Jan 7, 2016

jbenet Jan 7, 2016

mildred Jan 8, 2016

jbenet commented Jan 7, 2016

jbenet commented Jan 7, 2016

mildred commented Jan 8, 2016

mildred commented Jan 8, 2016

the8472 commented Jan 8, 2016

mildred commented Jan 10, 2016

mildred commented Jan 10, 2016

mildred Jan 20, 2016

jbenet Feb 11, 2016

jbenet commented Feb 11, 2016

mildred commented Feb 12, 2016

jbenet commented Feb 12, 2016

mildred commented Feb 12, 2016

jbenet commented Feb 12, 2016


		### Other proposition that does away with escaping

		We can imagine another transformation where the link names are not escaped. For example:


		The conversion to the IPLD data model must have the following properties:

		- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way.

Relationship with Protocol Buffers legacy IPFS node format #59

Relationship with Protocol Buffers legacy IPFS node format #59

Conversation

mildred commented Jan 7, 2016

mildred commented Jan 7, 2016

Relationship with Protocol Buffers legacy IPFS node format

Current encoding in go-ipld

Other proposition that does away with escaping

Other encodings

jbenet Jan 7, 2016

Choose a reason for hiding this comment

mildred Jan 8, 2016

Choose a reason for hiding this comment

jbenet commented Jan 7, 2016

jbenet commented Jan 7, 2016

mildred commented Jan 8, 2016

mildred commented Jan 8, 2016

the8472 commented Jan 8, 2016

mildred commented Jan 10, 2016

mildred commented Jan 10, 2016

mildred Jan 20, 2016

Choose a reason for hiding this comment

jbenet Feb 11, 2016

Choose a reason for hiding this comment

jbenet commented Feb 11, 2016

mildred commented Feb 12, 2016

jbenet commented Feb 12, 2016

mildred commented Feb 12, 2016

jbenet commented Feb 12, 2016