Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

dag-json: cleanup and fully specify reserved space #356

Merged
merged 4 commits into from
May 19, 2021

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jan 25, 2021

I've been working ok figuring out how to codify the rules that we discussed sometime last meeting .. during some meeting. I can't find the particular meeting by looking at the minutes in https:/ipld/team-mgmt so I'm doing this from memory. But the main points were:

  • Strictly only allow certain forms in the {"/":...} namespace so that tokenizing parsers didn't have to buffer and unroll too much data but can quickly reject form without getting too deep.
  • Remove the multibase prefix from the encoded bytes.

After working through some of this, I've flipped on my opinion of the second point and would rather it stay as proper multibase, even if it's still strictly base64. It adds clarity and there's scope for using this for further validation if we want to. It's a single byte and I think it increases the clarity and reduces the number of edge cases. So I haven't included that change here, but discussion is welcome if you feel strongly!

The bit to pay most attention to below is "The Reserved Namespace" where I've tried to enumerate the rules. It turns out to be quite hard to explain it with crystal clarity, but that could be me so suggestions welcome!

Along with this, as exploration, I've been working on a version of JS DAG-JSON that does tokenized decoding and uses the same base encoder/decoder as the new JS DAG-CBOR codec, just by swapping out the backend. There's some consistency benefits here in having the same code paths and same affordances for applying data model strictness. Still uncertain whether I'll actually propose that we merge this, although I did uncover some bugs in the current implementation in the process, but the code is here: https:/ipld/js-dag-json/compare/rvagg/cborg (see the innards of DagJsonTokenizer for the token parsing rules that arise from the reserved space rules).

block-layer/codecs/dag-json.md Outdated Show resolved Hide resolved
block-layer/codecs/dag-json.md Outdated Show resolved Hide resolved
block-layer/codecs/dag-json.md Outdated Show resolved Hide resolved
Comment on lines 71 to 73
* Maps with more than one key, where the first key is `"/"` and its value is a string. e.g. `{"/":"foo","bar":"baz"}`.
- Where a key exists that sorts before `"/"`, the map is valid, e.g. `{"0bar":"baz","/":"foo"}`.
- Where the value of the `"/"` entry is not a string, the map is valid, e.g. `{"/":true,"bar":"baz"}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me awhile to grasp that the main item is the rule to reject something and that the sub-items are rules that would not be rejected.

Perhaps it could be expended and we give the rules even names. Example:

Multiple keys

Rejection criteria: Maps with more than one key, where the first key is "/" and its value is a string.

Examples for invald DAG-JSON:

  • {"/":"foo","bar":"baz"}.
  • {"/":"foo","bar":true}.

Examples for valid DAG-JSON:

  • {"0bar":"baz","/":"foo"}: There is a key that sorts before "/". The value following the "/" will not be interpreted as link.
  • {"/":true,"bar":"baz"}: There value after the "/" is not a string.
  • {"/":"foo"}: There is only a the single map key "/"`. The value is expected to be a valid base encoded CID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed this, but worth a review

block-layer/codecs/dag-json.md Outdated Show resolved Hide resolved
@warpfork
Copy link
Contributor

I went digging for our discussion notes (and merged some stale PRs) and found them here:

https:/ipld/team-mgmt/blob/master/meeting-notes/2020/2020-11-05--ipld-specs.md#dag-json-bytes-and-links-discuss-parsing-rules-edge-cases

@rvagg rvagg force-pushed the rvagg/dag-json-cleanup branch 2 times, most recently from 881cad3 to 0911186 Compare May 8, 2021 08:12
@rvagg rvagg marked this pull request as ready for review May 8, 2021 08:13
@rvagg
Copy link
Member Author

rvagg commented May 8, 2021

I've addressed the reviews so far and I think this is ready for landing. The reserved namespace rules are hard to be clear about but I've done the shuffle recommended by @vmx and separated things out a little more to hopefully make it easier to digest.

I went with the .0 rule for floats with no fractional component - although this is no help to JavaScript but Go can implement it.

I also have a working JavaScript version that I'm pretty much ready to release. It uses the same backend as the DAG-CBOR codec but with JSON semantics. We've lost some speed in the process but we get all the reserved namespace rules and all the strictness.

@warpfork care to review the changes?

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite of the "Parse rejection modes" reads great now!

JavaScript itself that has trouble with them. This means JS implementations
of DAG-JSON can't use the native JSON parser and serializer if integers bigger
than `2^53 - 1` should be supported.
Data Model floats that do not have a fractional component should be encoded **with** a decimal point, and will therefore be distinguishable from an integer during round-trip. (Note that since JavaScript still cannot distinguish a float from an integer where the number has no fractional component, this rule will not impact JavaScript encoding or decoding).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the note about JavaScript is needed as it does exactly that when encoding/decoding JSON:

> JSON.stringify(2.1)
'2.1'
> JSON.stringify(2.0)
'2'
> JSON.parse('2.1')
2.1
> JSON.parse('2.0')
2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you're making my point @vmx. The spec is saying that 2.0 should be encoded as 2.0 but JavaScript can only do 2 since we can't distinguish. Then in the reverse the spec says that a 2.0 should be decoded as a "float" but JavaScript still can't distinguish so a 2 and a 2.0 would come out as the same thing, regardless. Maybe one day we'll retain the information that "this was a float in the encoded form" but there's no infrastructure to attach that information today.

So the note basically says "JavaScript can't conform to this part of the spec". The spec says "... floats that do not have a fractional component should be encoded with a decimal point".

@rvagg
Copy link
Member Author

rvagg commented May 11, 2021

As per a brief discussion today:

  • go-ipld-prime just had "bytes" implemented @ Allow emitting & parsing of bytes per dagjson codec spec go-ipld-prime#166 but this was done without the multibase prefix as per the current version of the spec
  • We have previously discussed removal of the multibase prefix but this spec change didn't include that -- see the original post for a discussion on that and why it was removed from scope.
  • The current Rust implementation has a multibase prefix and Forest has some test fixtures using dag-json, and the current JavaScript dag-json has a multibase prefix too.

So either we should fix go-ipld-prime to match spec or we just bite the bullet and remove it and get it done everywhere—I can do this in JavaScript pretty quickly after merging this.

@warpfork @vmx @willscott (and anyone else) 👍 for just remove multibase prefix and 👎 for leave the spec alone darn it!.

@vmx
Copy link
Member

vmx commented May 11, 2021

My reasons for having the multibase prefix:

  • consistency: the CIDs are also Multibases. The rule would be: all base encoded things are Multibases (how could would it be if that would be true across all our specs?)
  • we do self-describing things: to me having a multibase isn't necessarily a signal that any arbitrary Multibase is allowed, it's for being self-described.
  • not performance: if you care about performance, don't use DAG-JSON. An even if you care about allocations, for base encodings you need to over-allocate slightly anyway, so just allocate one byte more for the prefix
  • why change now: generally we are hesitate to change things in specs that are already out there, unless we really have to. We don't know who else has implemented that spec already. So why changing the status quo if it doesn't really make a practical difference (except for one of our own libraries we control that didn't follow the spec).

All that said, I don't have a strong opinion and can live with either way.

@rvagg
Copy link
Member Author

rvagg commented May 11, 2021

at first I couldn't get them making the same bytes and thought there must be a multibase incompatibility (the horror!) but it's []byte('deadbeef') being deceptive cause that's actually a string, not a hex value in spite of the nice hex characters..

also, sorting, ipld-prime don't sort and that engages hostilities in @warpfork and is not fun for compatibility fixtures 🤦 and I kind of don't want to have that argument over and over so will skip it for now.

But basically, it's not a big deal to do it in go-ipld-prime. It's also not a big deal to strip it in @ipld/dag-json, but I'm also a soft -1 to changing the spec and leaving the prefix in there for reasons stated in this thread and restated by @vmx. But, it's not a big enough deal if someone else has particularly strong feelings and justification.

@warpfork
Copy link
Contributor

warpfork commented May 11, 2021

Arguments against inserting multibase in this bytes field, just to have them written:

  • it will become "another one of those bytes" that anyone wanting compactness will question and question and question again -- just like the extra "zero" byte in CIDs in DAG-CBOR which confuses, surprises, and annoys every single person who ever has cause to notice it.
    • I've stopped saving references to every time this has happened about that zero in DAG-CBOR, but it's... many times. People (me included) repeatedly implement DAG-CBOR without it, and then go "what?!" when a fixture points out that it's needed; or people think it's a bug, and try to remove it again later on in development cycles! And this regularly comes from people working "down the hall" from us, not just distant strangers. Even if we nevermind the excess bytes on the wire, the number of human hours this has wasted over time does not sit well with me. I'd like to not do this again.
  • having a multibase indicator would suggest it's acceptable to use other multibase indicators here... which it is not; in turn, it is not desirable to all multiple multibases to be used here because it would provide negative value if DAG-JSON were to be more expressive than the IPLD Data Model in this position.
  • holistically: I think multibases are to be used when you've got a string and it's fairly user-facing, and there are no other disambiguators or indicators in the area that could help you know what that value is. This is not the scenario we have in the middle of dag-json: we already have 1) the multicodec indicator, from the outside, 2) all the information/confidence provided by the success of parsing as dag-json throughout the block, 3) the information/confidence provided by a single-key map with a "/" key, and 4) the information/confidence provided by a single-key map with a "bytes" key, and then 5) the confidence provided by the success of parsing the value therein as base64. This is a huge amount of confidence and a ton of non-accidental structure, and it seems to me heaping on another byte just to "be sure" it's clear what base is expected here would be wild overkill.
    • The (admittedly rather fine) distinction I'm trying to make here is that the "natural form" of this data is bytes -- it's not a string of some multibase, it's bytes -- and dag-json is in the business of specifying how those bytes are encoded in the serial format. I think that we have a goal of using indicators like multibase in contexts where there are not yet indicators; but in the the middle of a codec's serial forms is not such a context.
      • Can you imagine if we put a "zero" prefix on every bytes scalar in DAG-CBOR, too, just to make it really clear that they're bytes?
  • We argued about this before in https:/ipld/team-mgmt/blob/master/meeting-notes/2020/2020-11-05--ipld-specs.md#votes.
    • I distinctly remember trying to hold a vote and do an actual formal count -- and now do regret not forcing that! -- but as the notes say, and we reviewed and merged, in video, we had entirely assenting voices for removing multibase. There were more than a half dozen people there, fully engaged, and having just collaboratively reviewed the entire subject.
    • That meeting had more people and more of their attention dedicated to it than this re-review today has got, so I'd prefer to stick with it.

@rvagg
Copy link
Member Author

rvagg commented May 12, 2021

fine, you appear to have a stronger opinion than either of us (although I'm not sure I'm convinced it's all that strong)

have pushed a new version that removes it and will do likewise in the JS impl

@rvagg
Copy link
Member Author

rvagg commented May 12, 2021

(PTAL & merge if 👍)

rvagg added a commit to ipld/js-dag-json that referenced this pull request May 12, 2021
rvagg added a commit to ipld/js-dag-json that referenced this pull request May 13, 2021
@rvagg rvagg merged commit 9bab36f into master May 19, 2021
@rvagg rvagg deleted the rvagg/dag-json-cleanup branch May 19, 2021 05:30
rvagg added a commit to ipld/js-dag-json that referenced this pull request May 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants