Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Spec refining: specify valid merklepath segments and encoding #67

Closed
warpfork opened this issue Jul 17, 2018 · 4 comments · Fixed by #171
Closed

Spec refining: specify valid merklepath segments and encoding #67

warpfork opened this issue Jul 17, 2018 · 4 comments · Fixed by #171

Comments

@warpfork
Copy link
Contributor

warpfork commented Jul 17, 2018

Mission

It's important to specify precisely what is a valid merklepath segement in IPLD.

The spec currently contains a "TODO: list path resolving restrictions" and this could be improved :)


Why

First, a quick clarification: "merklepath segments" are a distinct concept from "IPLD Selectors". Merklepaths are a specific and limited implementation of IPLD Selectors; they can only specify a traversal to a single object; and importantly, we want them to be serializable in a way that's easy for humans to operate. To quote the current spec for motivations:

IPLD paths MUST layer cleanly over UNIX and The Web (use /, have deterministic transforms for ASCII systems).

(Perhaps "ASCII" is a little over-constrained there. The spec also says "
IPLD paths MUST be universal and avoid oppressing non-english societies (e.g. use UTF-8, not ASCII)" -- we might want to refine those two lines after we tackle the rest of this issue.)

Second of all, just a list of other issues that are fairly closely related to a need for clarity on this subject:

As this list makes evident... we really need to get this nailed down.


Mission, refined

Okay, motivations and intro done. What do we need to do?

(1) Update the spec to be consistently clear about IPLD keys versus valid merklepath segments. This distinction seems to exist already, but it's tricky, so we should hammer it.

(2) Define normal character encoding. (I think it's now well established that this is necessary -- merklepath segments are absolutely for direct human use, so we're certainly speaking of chars rather than bytes; and also unicode is complex and ignoring normalization is not viable.)

(3) Define any blacklisting of any further byte sequences which are valid normalized characters but we nonetheless don't want to see in merklepath segments.

(4) Ensure we're clear about what happens when an IPLD key is valid but as a key but not a merklepath segment (e.g. it's unpathable).

(And one more quick note: a lot of this has been in discussion already as part of sussing out the unixfsv2 spec. In unixfsv2, we've come to the conclusion that some of our path handling rules are quantum-entangled with the IPLD spec for merklepaths. Unixfsv2 may apply more blacklistings of byte sequences which are problematic than IPLD merklepath segements, so we don't have to worry about everything up here; but we do want to spec this first, so we can make sure the Unixfsv2 behavior normalizers are a nice subset of the IPLD merklepath rules.)


Progress

Regarding (1): "just a small matter of writing" once we nail the rest...

Regarding (2): We have an answer and the answer is "NFC". (At least, I think we have an answer with reasonable consensus.) We had a long thread about this in the context of unixfsv2, but entirely applicable here in general. Everyone seems to agree that UTF8 is a sensible place to be and NFC encoding is a sensible, already-well-specified normalization to use. And importantly, in practice, NFC is the encoding seen in practically all documents everywhere, so choosing NFC means we accept the majority of strings unchanged. Whew. dusts off hands

Regarding (3): Lots of example choices in ipfs/kubo#1710 . We need to reify that into a list in the spec.

Regarding (4): Open field?

(I'll update this "progress" section as discussion... progresses.)

@whyrusleeping
Copy link
Contributor

👏 so, it we agree on NFC. The next step is to figure out the list of bad byte sequences for keys? And then to agree on something like "parsing an ipld object with bad keys should error", right?

@Stebalien
Copy link
Contributor

Note: While I agree that UTF8-NFC should be the canonical encoding, some formats, languages, and operating systems my force us to use different encodings. We just have to make sure to use UTF8-NFC when comparing.

@warpfork
Copy link
Contributor Author

agree on something like "parsing an ipld object with bad keys should error", right?

In the unixfsv2 discussions, I actually proposed our implementation should probably take a func(rejected []byte, reason error) (replacement []byte, orHalt error) (exact signature TBD) parameter which defines how we respond to that situation. In some cases, when creating an object and finding a non-joyful key string, we'll want to halt and warn the user. In some cases, we'll have been told by the user to go ahead, but do some normalization. And in some cases, perhaps will simply want to log (especially in early drafts of this, to see how widely occurring tricky scenarios actually are in in the wild). How exactly to type that function might need a little sussing out, but I think its the right approach, in that it doesn't require us to pick just one approach.

@Stebalien
Copy link
Contributor

And then to agree on something like "parsing an ipld object with bad keys should error", right?

IIRC, the consensus is to just not allow them in paths. Just like /.

vmx added a commit that referenced this issue Aug 20, 2019
This report was originally issue #67.

Closes #67.
vmx added a commit that referenced this issue Aug 20, 2019
This report was originally issue #67.

Closes #67.
rvagg pushed a commit that referenced this issue Sep 2, 2019
This report was originally issue #67.

Closes #67.
rvagg pushed a commit that referenced this issue Sep 9, 2019
This report was originally issue #67.

Closes #67.
rvagg pushed a commit that referenced this issue Sep 9, 2019
This report was originally issue #67.

Closes #67.
rvagg pushed a commit that referenced this issue Sep 10, 2019
This report was originally issue #67.

Closes #67.
rvagg pushed a commit that referenced this issue Sep 10, 2019
This report was originally issue #67.

Closes #67.
prataprc pushed a commit to iprs-dev/ipld-specs that referenced this issue Oct 13, 2020
This report was originally issue ipld#67.

Closes ipld#67.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants