Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relationship with Protocol Buffers legacy IPFS node format #59

Merged
merged 9 commits into from
Feb 12, 2016

Conversation

mildred
Copy link
Contributor

@mildred mildred commented Jan 7, 2016

In PR #37, we left out an important part of the spec aside: the relationship with protocol buffer serialization. This ought to be described as it has effects that may be far reaching.

TODO items:

  • Decide if we choose a format that requires path component escaping: no escaping needed
  • Decide which special key to use to avoid conflict with path component (@attrs in one proposition, with @ escaping, . in the other proposition): not needed

@jbenet jbenet added the backlog label Jan 7, 2016
@mildred mildred mentioned this pull request Jan 7, 2016
5 tasks
@mildred
Copy link
Contributor Author

mildred commented Jan 7, 2016

What is added:

Relationship with Protocol Buffers legacy IPFS node format

IPLD has a known conversion with the legacy Protocol Buffers format. This format is defined with the Protocol Buffers syntax as:

message PBLink {
    optional bytes  Hash = 1;
    optional string Name = 2;
    optional uint64 Tsize = 3;
}

message PBNode {
    repeated PBLink Links = 2;
    optional bytes  Data = 1;
}

The conversion to the IPLD data model must have the following properties:

  • It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way.
  • When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names.
  • Link names should not conflict with other keys.

There are multiple ways to do that that will be described next.

Current encoding in go-ipld

go-ipld implements the following conversion:

{
  "<Links[0].Name.(escaped)>": {
    "hash": "<Links[0].Hash>",
    "name": "<Links[0].Name>",
    "size": <Links[0].Tsize>
  },
  "<Links[1].Name.(escaped)>": {
    "hash": "<Links[1].Hash>",
    "name": "<Links[1].Name>",
    "size": <Links[1].Tsize>
  },
  ...
  "@attrs": {
    "data": "<Data>",
    "links": [
      {
        "hash": "<Links[0].Hash>",
        "name": "<Links[0].Name>",
        "size": <Links[0].Tsize>
      },
      {
        "hash": "<Links[1].Hash>",
        "name": "<Links[1].Name>",
        "size": <Links[1].Tsize>
      }
    ]
  }
}

Notes :

  • The links array in the @attrs section is there to preserve order and duplicate links to hash back to the exact same protocol buffer object.

  • The link names are escaped to prevent clashing with the @attr key.

  • The escaping mechanism transforms the @ character into \@. This mechanism also implies a modification of the path algorithm. When a path component contains the @ character, it must be escaped to look it up in the IPLD Node object.

    For example, a path a path /root/first@component/second@component/third_component would look for object root["first\@component"]["second\@component"]["third_component"] (following mlinks when necessary).

  • Links are represented using the hash key instead of mlink as used in this specification. This must be changed.

Other proposition that does away with escaping

We can imagine another transformation where the link names are not escaped. For example:

{
  "<Links[0].Name>": {
    "mlink": "<Links[0].Hash>",
    "tsize": <Links[0].Tsize>
  },
  "<Links[1].Name>": {
    "mlink": "<Links[2].Hash>",
    "tsize": <Links[2].Tsize>
  },
  ...
  ".": {
    "data": "<Data>",
    "links": [
      "<Links[0].Name>",
      {
        "name": "<Link[1].Name>",
        "mlink": "<Links[1].Hash>",
        "tsize": <Links[1].Tsize>
      }
      "<Links[2].Name>",
      ...
    ]
  }
}

Notes:

  • Very conveniently, we use the key . to represent data for the current node, and any other key can represent a link. This means that we forbid link to be named .. This is in any case a good thing to do as the . element in paths can always be removed (the same way .. can be replaced by the parent directory)

  • No escaping is needed, and no modification to the path algorithm is needed.

  • Link order is kept by using the link names for links present in the top node. This avoid repeating identical data (even though it is probably generated on the fly).

    Links that cannot be present in the top node (the case for the link named ., which is forbidden, or for links that are repeated with the same name) are present in full to allow reconstructing the exact protocol buffer object.

Other encodings

Speak up in comments if you have other encoding suggestions, or you have another implementation with another encoding.


### Other proposition that does away with escaping

We can imagine another transformation where the link names are not escaped. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont like this one (using .) as much. i prefer the @attrs with escaping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even considering there is no escaping?

@jbenet
Copy link
Member

jbenet commented Jan 7, 2016

i think this should go in a separate doc that extends the spec, or as an example of a special case format.

we can have it inform (and modify) the proper spec, but i think including this format in the main document will confuse people? we could link to another doc.

@jbenet
Copy link
Member

jbenet commented Jan 7, 2016

@mildred thanks for this!! 👍

@mildred
Copy link
Contributor Author

mildred commented Jan 8, 2016

I think this should go in a separate doc that extends the spec, or as an example of a special case format.

In case we go the route of escaping, this needs to be linked in other parts of the spec as it has impact on it.

@mildred
Copy link
Contributor Author

mildred commented Jan 8, 2016

Added considerations about escaping in the path section. See also PR #60 for similar changes.

@the8472
Copy link

the8472 commented Jan 8, 2016

Based on IRC discussion and reading #60 it seems to me that traversing the IPLD structure should not be shoehorned into IP*S paths. It's more of a reflection / metadata thing for the paths themselves. Equivalent to stating a file.

Putting arbitrary keys at the top level of an associative array (hash) also scratches me the wrong way.

The metadata could look like this { attrs: {...}, namedlinks: {foo: ..., bar: ...}} while the file path still only contains "foo" without "namedlinks". If you want to access the metadata you could go through a separate namespace, e.g. /ipld/ipfs/<hash>/namedLinks/foo/attrs/

@mildred mildred mentioned this pull request Jan 9, 2016
1 task
@mildred
Copy link
Contributor Author

mildred commented Jan 10, 2016

Putting arbitrary keys at the top level of an associative array (hash) also scratches me the wrong way.

Especially considering that unixfs paths are separate from IPLD paths according to last state of PR #62. There is nothing preventing the data to look like this as @the8472 told us:

{
  "data": "<Data>",
  "named-links": {
    "<Links[0].Name>": {
      "mlink": "<Links[0].Hash.(base58)>",
      "name": "<Links[0].Name>",
      "size": <Links[0].Tsize>
    },
    "<Links[1].Name>": {
      "mlink": "<Links[1].Hash.(base58)>",
      "name": "<Links[1].Name>",
      "size": <Links[1].Tsize>
    },
    ...
  }
  "ordered-links": [
    {
      "mlink": "<Links[0].Hash.(base58)>",
      "name": "<Links[0].Name>",
      "size": <Links[0].Tsize>
    },
    {
      "mlink": "<Links[1].Hash.(base58)>",
      "name": "<Links[1].Name>",
      "size": <Links[1].Tsize>
    }
  ]
}

Is there a reason why we couldn't do this way. @jbenet?

@mildred
Copy link
Contributor Author

mildred commented Jan 10, 2016

Updated the conversion format.

The format is encapsulated after a multicodec header that tells which codec to use. In addition, older applications that do not yet use the multicodec header will transmit a protocol buffer stream. This can be detected by looking at the first byte:

- if the first byte is between 0 and 127, it is a multicodec header
- if the first byte if between 128 and 255, it is a protocol buffer stream
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I am wrong here. The multicodec header length is not limited to one byte, but can be encoded in multiple bytes if it is above 127. Setting the MSB to 1 is just the way varint works.

I also assumed that protocol buffers always started with a byte with MSB set to 1, but I don't know if that true. Probably not.

So, it's probably not possible to detect in a such easy way if we are transmitting a multicodec header or a protocol buffer message. I'm currently rewriting this part (as I am implementing it in go-ipld).

Fix the paragraph about the first byte that is able to determine if the
data in prefixed by a multicodec or is a protocol buffer object.

The conversion to the IPLD data model must have the following properties:

- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "- It MUST be ..."

@jbenet
Copy link
Member

jbenet commented Feb 11, 2016

minor things, otherwise LGTM

@mildred
Copy link
Contributor Author

mildred commented Feb 12, 2016

@jbenet I pushed some fixes, also, what do you think of the ordered-links vs named-links issue ?

@jbenet
Copy link
Member

jbenet commented Feb 12, 2016

#59 (comment)

@mildred
Copy link
Contributor Author

mildred commented Feb 12, 2016

@jbenet I removed the named links section and made it an option for the implementations as it is not an important part of the spec.

@jbenet
Copy link
Member

jbenet commented Feb 12, 2016

SGTM! 👍 thanks

jbenet added a commit that referenced this pull request Feb 12, 2016
Relationship with Protocol Buffers legacy IPFS node format
@jbenet jbenet merged commit fc38955 into ipfs:ipld-spec Feb 12, 2016
@jbenet jbenet removed the backlog label Feb 12, 2016
@daviddias daviddias added the IPLD label Mar 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants