Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Issue in Object Grouping #163

Closed
xogeny opened this issue Feb 26, 2018 · 5 comments
Closed

Semantic Issue in Object Grouping #163

xogeny opened this issue Feb 26, 2018 · 5 comments

Comments

@xogeny
Copy link
Contributor

xogeny commented Feb 26, 2018

I noticed, while reviewing the evaluation code, that there is what I think is an incorrect assumption. The evaluation for objects appears to work as follows:

We loop over each input value. We evaluate the key expression in the context of that input value. We then append that input value to the list of values we will use later for that key (more on that in a second). We also note the expression associated with that key and record that as well so we can evaluate that value expression in the context of each of the inputs we are storing away. But here is the problem...the value expression isn't always the same but we only store the first one we run into. I'm pretty sure this isn't correct and it creates some odd issues. For example, imagine our input (for all cases I'm going to consider) is:

[
  {
    "key": "foo",
    "value": 5
  },
  {
    "key": "bar",
    "value": 10
  }
]

Now if I evaluate this: { key: value }, I get pretty much what you might expect, i.e., { "foo": 5, "bar": 10 }.

So far so good. If I evaluate { key: (value+1) }, I also get pretty much what I would expect: { "foo": 6, "bar": 11 }.

Now how about if I evaluate: { key: value, key: value }. Now we have overlapping values for the key expressions. That's actually ok and it works (and I think it is reasonable to allow overlapping key expressions, BTW). It evaluates to:

{
  "foo": [
    5,
    5
  ],
  "bar": [
    10,
    10
  ]
}

I'm no so sure this is the most intuitive result. Some might expect the second to overwrite the first in this case (i.e., some might expect { "foo": 5, "bar": 10 }), but it doesn't. Instead, it appends the results. But at least we get an answer that we can explain even if it might seem unintuitive.

But now things start to misbehave. Imagine I evaluate this instead: { key: value, key: value+1 }. I get the same answer as for { key: value, key: value }. This is because value is tucked away as the expression associated with "foo" and "bar" because it is the first one encountered, so value+1 is never evaluated. But things can actually get worse. If you switch the order and try to evaluate { key: value+1, key: value } you actually get an error that it cannot evaluate value because it isn't a number.

Now, one thing you can do is change the semantics slightly so you actually evaluate the value expressions immediately in the context of each input value (rather than storing them up). This actually makes sense for the value+1 case and seems reasonably intuitive. Said another way, it may be more intuitive to do the evaluate this way because I suspect it is what people would expect. But, this will actually breaks one of the test cases, i.e., the containing the expression:

Account.Order.Product{$string(ProductID): (Price)[0]}

Because the [0] in there become meaningless because Price always evaluates to a scalar rather than an array. So that approach would change the semantics slightly. Presumably the [0] is there specifically to grab the first value.

The alternative is to append not just each successive input value associated with each key but also each unique value expression as well (instead of just the first one). I suspect you'd probably want this alternative but I wanted to thoroughly explain the problem and collect feedback to be sure.

To be honest, implementing these evaluation semantics is a bit complicated. Furthermore, that error for value+1 is pretty confusing and unintuitive. So before working on implementing anything more (I've implemented a couple different approaches already), I thought it would be good to discuss what is most intuitive.

Any comments?

@xogeny
Copy link
Contributor Author

xogeny commented Feb 27, 2018

Looking into this a bit more, I realize that my alternative suggestion of storing the expression with input value (instead of assuming only the first expression encountered for a given key expression) doesn't work for expressions like:

Account.Order.Product{$string(ProductID): (Price)[0]}

But this brings up another issue with this expression. The ()s in the expression are significant. If you evaluate the following expression:

Account.Order.Product{$string(ProductID): Price[0]}

...you get a different answer. This is quite surprising and unintuitive for the reasons I gave in #160.

@xogeny
Copy link
Contributor Author

xogeny commented Feb 27, 2018

So I have another alternative to propose which is to avoid "accumulating" values sharing a key. To understand the proposal, consider the current semantics. If we have the following input:

[
  {
    "key": "foo",
    "value": 5
  },
  {
    "key": "foo",
    "value": 10
  }
]

Because we have a repeated value for key, if we evaluate { key: value } for this input we get:

{
  "foo": [
    5,
    10
  ]
}

Note how jsonata aggregates the values when a key is repeated. I'm not sure this is actually that useful or intuitive. This is why the test suite has to have this expression:

Account.Order.Product{$string(ProductID): (Price)[0]}

...when it encounters multiple values for the same key. Note that it uses the [0] to return only the first element.

Thinking about it, I think a much more intuitive approach would be for { key: value } to simply return:

{
  "foo": 10
}

In this approach, the values don't accumulate. Instead, the last value overwrites any previous values. This has the advantage that a) it doesn't really impact most cases (I suspect people using this feature don't expect to get multiple values for a given key), b) it is more intuitive and easier to reason about, c) it doesn't require adding a [0] to get just the first (or [-1] for the last value), d) it is easier to implement (at least it was for me) and, e) you can get the current (aggregating) behavior by simply using map syntax and passing the result to a special (and explicitly stated and yet to be added) aggregate function, e.g.,

$.{ key: value} ~> $aggregate

The thinking here is that by itself, $.{ key: value} (*note the . in front of the {) would function exactly as it does now and would return:

[
  {
    "foo": 5
  },
  {
    "foo": 10
  }
]

Then, passing this result into this (hypothetical) $aggregate function would merge these into a single object, i.e.,

{
  "foo": [
    5,
    10
  ]
}

It is worth noting that with the semantics I'm proposing, the following two expressions would yield the same results (at least based on my current testing):

{ key: value }
$.{ key: value } ~> $merge

...while, as already noted, the current (v1.5) semantics could be achieved with:

$.{ key: value } ~> $aggregate

The bottom line is that I think we have to do something because the current semantics really do not handle the case of multiple keys properly. This proposal is an attempt to come up with an alternative that gives the same results for most cases, is easy to implement, is intuitive and easy to reason about and handles both major cases (keep only the last value, keep all values).

@andrew-coleman
Copy link
Member

One of the prime motivations for this syntax and semantics was to be able to do the equivalent of an SQL 'group by / aggregation' expression whereby you can create summary data to address questions such as "give me the total sales of product X per region" or "... per sales manager".

An example I have presented in the past involves taking the daily download figures of JSONata from NPM and summarising them as total downloads per month.

Clearly there are some problems with overlapping keys when there are multiple key/value expressions as you have identified. I will try and spend some time to see if there is a way to resolve these without losing sight of the original goal.

As always, many thanks for your insightful analysis on these issues. Please do keep it coming!

@xogeny
Copy link
Contributor Author

xogeny commented Feb 27, 2018

OK, that is useful in understanding the use case. In that case my ~> $aggregate approach wouldn't work because it would be quite awkward to do the $sum or $average types of reductions.

Hmm...this is a bit tricky. The fundamental issue I see here is that you want to have a one-to-one mapping between a particular key value and an expression. But the problem is that the key can itself be an expression so what you really have is a one-to-one mapping between the key expression and the value expression. Here is a simple (albeit contrived) example that triggers the issue. Consider the following input:

[
  { "type": "a", "kind": "a", "value": 0 },
  { "type": "a", "kind": "b", "value": 1 },
  { "type": "b", "kind": "a", "value": 2 },
  { "type": "b", "kind": "b", "value": 3 }
]

Now if we attempt to evaluate this expression:

{ type: $average(value), kind: $sum(value) }

(i.e., compute the average for all values based on their type and compute the sum based on their kind)

We get:

{
  "a": 0.75,
  "b": 9
}

In this case, the "a"s get averaged and the "b"s get summed. But if we change the order of the inputs to:

[
  { "type": "a", "kind": "a", "value": 0 },
  { "type": "b", "kind": "a", "value": 2 },
  { "type": "a", "kind": "b", "value": 1 },
  { "type": "b", "kind": "b", "value": 3 }
]

The same expression evaluates to:

{
  "a": 0.75,
  "b": 2.25
}

This is because in the previous case, the expression associated with "b" was $sum(value) because that was the first value expression that got associated with a key value of "b". But by changing the order, the first entry to evaluate a key to "b" is the $average(value) expression.

Now I understand what you are trying to do and I suspect it could still be achieved using transforms but at a considerable cost in terms of complexity of the expression. Yet another alternative would be to simply detect cases where a given key value wound up having different value expressions associated with it and flagging that as some kind of semantic error. It is a little distasteful (in part because one can never know what input data will flow into the expression and, as such, it would be hard to predict such errors). On the other hand, in thinking about this for a while I was hard pressed to come up with a real-world use case that would actually trigger such a semantic error. So it may be that it would only come up in contrived circumstances. In fact, I suspect even having key expressions (other than string literals) will be somewhat uncommon in the first place and key expressions in the same object that sometimes overlap will be even rarer.

But I think what is clear is that something needs to be done because the current evaluation semantics don't make any sense for these contrived cases. We either need semantics that make sense in all cases or generate semantic errors when it becomes impossible to understand the users intent (as in the mixed key case).

mtiller added a commit to mtiller/jsonata that referenced this issue Mar 7, 2018
@xogeny
Copy link
Contributor Author

xogeny commented Mar 14, 2018

@andrew-coleman and I met and decided that the best approach would be to create a new class of semantic errors that detect when this has happened.

mtiller added a commit to mtiller/jsonata that referenced this issue Mar 23, 2018
mtiller added a commit to mtiller/jsonata that referenced this issue Mar 23, 2018
This includes fixes for jsonata-js#163 (including detection of the new semantic error).
Note that I've changed the code the evaluates object construction to map
more closely to how v1.5+ did it.

Still having issues with the flattening cases, but I wanted to checkpoint here
since all the previously passing tests are working.
mtiller added a commit to mtiller/jsonata that referenced this issue Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants