Semantic Issue in Object Grouping #163

xogeny · 2018-02-26T22:48:34Z

I noticed, while reviewing the evaluation code, that there is what I think is an incorrect assumption. The evaluation for objects appears to work as follows:

We loop over each input value. We evaluate the key expression in the context of that input value. We then append that input value to the list of values we will use later for that key (more on that in a second). We also note the expression associated with that key and record that as well so we can evaluate that value expression in the context of each of the inputs we are storing away. But here is the problem...the value expression isn't always the same but we only store the first one we run into. I'm pretty sure this isn't correct and it creates some odd issues. For example, imagine our input (for all cases I'm going to consider) is:

[
  {
    "key": "foo",
    "value": 5
  },
  {
    "key": "bar",
    "value": 10
  }
]

Now if I evaluate this: { key: value }, I get pretty much what you might expect, i.e., { "foo": 5, "bar": 10 }.

So far so good. If I evaluate { key: (value+1) }, I also get pretty much what I would expect: { "foo": 6, "bar": 11 }.

Now how about if I evaluate: { key: value, key: value }. Now we have overlapping values for the key expressions. That's actually ok and it works (and I think it is reasonable to allow overlapping key expressions, BTW). It evaluates to:

{
  "foo": [
    5,
    5
  ],
  "bar": [
    10,
    10
  ]
}

I'm no so sure this is the most intuitive result. Some might expect the second to overwrite the first in this case (i.e., some might expect { "foo": 5, "bar": 10 }), but it doesn't. Instead, it appends the results. But at least we get an answer that we can explain even if it might seem unintuitive.

But now things start to misbehave. Imagine I evaluate this instead: { key: value, key: value+1 }. I get the same answer as for { key: value, key: value }. This is because value is tucked away as the expression associated with "foo" and "bar" because it is the first one encountered, so value+1 is never evaluated. But things can actually get worse. If you switch the order and try to evaluate { key: value+1, key: value } you actually get an error that it cannot evaluate value because it isn't a number.

Now, one thing you can do is change the semantics slightly so you actually evaluate the value expressions immediately in the context of each input value (rather than storing them up). This actually makes sense for the value+1 case and seems reasonably intuitive. Said another way, it may be more intuitive to do the evaluate this way because I suspect it is what people would expect. But, this will actually breaks one of the test cases, i.e., the containing the expression:

Account.Order.Product{$string(ProductID): (Price)[0]}

Because the [0] in there become meaningless because Price always evaluates to a scalar rather than an array. So that approach would change the semantics slightly. Presumably the [0] is there specifically to grab the first value.

The alternative is to append not just each successive input value associated with each key but also each unique value expression as well (instead of just the first one). I suspect you'd probably want this alternative but I wanted to thoroughly explain the problem and collect feedback to be sure.

To be honest, implementing these evaluation semantics is a bit complicated. Furthermore, that error for value+1 is pretty confusing and unintuitive. So before working on implementing anything more (I've implemented a couple different approaches already), I thought it would be good to discuss what is most intuitive.

Any comments?

The text was updated successfully, but these errors were encountered:

xogeny · 2018-02-27T00:54:17Z

Looking into this a bit more, I realize that my alternative suggestion of storing the expression with input value (instead of assuming only the first expression encountered for a given key expression) doesn't work for expressions like:

Account.Order.Product{$string(ProductID): (Price)[0]}

But this brings up another issue with this expression. The ()s in the expression are significant. If you evaluate the following expression:

Account.Order.Product{$string(ProductID): Price[0]}

...you get a different answer. This is quite surprising and unintuitive for the reasons I gave in #160.

xogeny · 2018-02-27T01:26:23Z

So I have another alternative to propose which is to avoid "accumulating" values sharing a key. To understand the proposal, consider the current semantics. If we have the following input:

[
  {
    "key": "foo",
    "value": 5
  },
  {
    "key": "foo",
    "value": 10
  }
]

Because we have a repeated value for key, if we evaluate { key: value } for this input we get:

{
  "foo": [
    5,
    10
  ]
}

Note how jsonata aggregates the values when a key is repeated. I'm not sure this is actually that useful or intuitive. This is why the test suite has to have this expression:

Account.Order.Product{$string(ProductID): (Price)[0]}

...when it encounters multiple values for the same key. Note that it uses the [0] to return only the first element.

Thinking about it, I think a much more intuitive approach would be for { key: value } to simply return:

{
  "foo": 10
}

In this approach, the values don't accumulate. Instead, the last value overwrites any previous values. This has the advantage that a) it doesn't really impact most cases (I suspect people using this feature don't expect to get multiple values for a given key), b) it is more intuitive and easier to reason about, c) it doesn't require adding a [0] to get just the first (or [-1] for the last value), d) it is easier to implement (at least it was for me) and, e) you can get the current (aggregating) behavior by simply using map syntax and passing the result to a special (and explicitly stated and yet to be added) aggregate function, e.g.,

$.{ key: value} ~> $aggregate

The thinking here is that by itself, $.{ key: value} (*note the . in front of the {) would function exactly as it does now and would return:

[
  {
    "foo": 5
  },
  {
    "foo": 10
  }
]

Then, passing this result into this (hypothetical) $aggregate function would merge these into a single object, i.e.,

{
  "foo": [
    5,
    10
  ]
}

It is worth noting that with the semantics I'm proposing, the following two expressions would yield the same results (at least based on my current testing):

{ key: value }

$.{ key: value } ~> $merge

...while, as already noted, the current (v1.5) semantics could be achieved with:

$.{ key: value } ~> $aggregate

The bottom line is that I think we have to do something because the current semantics really do not handle the case of multiple keys properly. This proposal is an attempt to come up with an alternative that gives the same results for most cases, is easy to implement, is intuitive and easy to reason about and handles both major cases (keep only the last value, keep all values).

andrew-coleman · 2018-02-27T17:02:15Z

One of the prime motivations for this syntax and semantics was to be able to do the equivalent of an SQL 'group by / aggregation' expression whereby you can create summary data to address questions such as "give me the total sales of product X per region" or "... per sales manager".

An example I have presented in the past involves taking the daily download figures of JSONata from NPM and summarising them as total downloads per month.

Clearly there are some problems with overlapping keys when there are multiple key/value expressions as you have identified. I will try and spend some time to see if there is a way to resolve these without losing sight of the original goal.

As always, many thanks for your insightful analysis on these issues. Please do keep it coming!

xogeny · 2018-02-27T19:27:10Z

OK, that is useful in understanding the use case. In that case my ~> $aggregate approach wouldn't work because it would be quite awkward to do the $sum or $average types of reductions.

Hmm...this is a bit tricky. The fundamental issue I see here is that you want to have a one-to-one mapping between a particular key value and an expression. But the problem is that the key can itself be an expression so what you really have is a one-to-one mapping between the key expression and the value expression. Here is a simple (albeit contrived) example that triggers the issue. Consider the following input:

[
  { "type": "a", "kind": "a", "value": 0 },
  { "type": "a", "kind": "b", "value": 1 },
  { "type": "b", "kind": "a", "value": 2 },
  { "type": "b", "kind": "b", "value": 3 }
]

Now if we attempt to evaluate this expression:

{ type: $average(value), kind: $sum(value) }

(i.e., compute the average for all values based on their type and compute the sum based on their kind)

We get:

{
  "a": 0.75,
  "b": 9
}

In this case, the "a"s get averaged and the "b"s get summed. But if we change the order of the inputs to:

[
  { "type": "a", "kind": "a", "value": 0 },
  { "type": "b", "kind": "a", "value": 2 },
  { "type": "a", "kind": "b", "value": 1 },
  { "type": "b", "kind": "b", "value": 3 }
]

The same expression evaluates to:

{
  "a": 0.75,
  "b": 2.25
}

This is because in the previous case, the expression associated with "b" was $sum(value) because that was the first value expression that got associated with a key value of "b". But by changing the order, the first entry to evaluate a key to "b" is the $average(value) expression.

Now I understand what you are trying to do and I suspect it could still be achieved using transforms but at a considerable cost in terms of complexity of the expression. Yet another alternative would be to simply detect cases where a given key value wound up having different value expressions associated with it and flagging that as some kind of semantic error. It is a little distasteful (in part because one can never know what input data will flow into the expression and, as such, it would be hard to predict such errors). On the other hand, in thinking about this for a while I was hard pressed to come up with a real-world use case that would actually trigger such a semantic error. So it may be that it would only come up in contrived circumstances. In fact, I suspect even having key expressions (other than string literals) will be somewhat uncommon in the first place and key expressions in the same object that sometimes overlap will be even rarer.

But I think what is clear is that something needs to be done because the current evaluation semantics don't make any sense for these contrived cases. We either need semantics that make sense in all cases or generate semantic errors when it becomes impossible to understand the users intent (as in the mixed key case).

xogeny · 2018-03-14T15:01:05Z

@andrew-coleman and I met and decided that the best approach would be to create a new class of semantic errors that detect when this has happened.

This includes fixes for jsonata-js#163 (including detection of the new semantic error). Note that I've changed the code the evaluates object construction to map more closely to how v1.5+ did it. Still having issues with the flattening cases, but I wanted to checkpoint here since all the previously passing tests are working.

…a-js#163) 1034 cases passing, none failing.

mtiller added a commit to mtiller/jsonata that referenced this issue Mar 7, 2018

Two cases skipped because of jsonata-js#163

131bd9c

andrew-coleman mentioned this issue Mar 22, 2018

throw error if multiple group-by expressions evaluate to same key #175

Merged

mtiller added a commit to mtiller/jsonata that referenced this issue Mar 23, 2018

Added semantic error and test cases for jsonata-js#163

fc2386a

mattbaileyuk closed this as completed in #175 Mar 24, 2018

mtiller added a commit to mtiller/jsonata that referenced this issue Apr 13, 2018

All tests passing (except related to issues jsonata-js#160 and jsonat…

d54c34a

…a-js#163) 1034 cases passing, none failing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Issue in Object Grouping #163

Semantic Issue in Object Grouping #163

xogeny commented Feb 26, 2018

xogeny commented Feb 27, 2018

xogeny commented Feb 27, 2018

andrew-coleman commented Feb 27, 2018

xogeny commented Feb 27, 2018

xogeny commented Mar 14, 2018

Semantic Issue in Object Grouping #163

Semantic Issue in Object Grouping #163

Comments

xogeny commented Feb 26, 2018

xogeny commented Feb 27, 2018

xogeny commented Feb 27, 2018

andrew-coleman commented Feb 27, 2018

xogeny commented Feb 27, 2018

xogeny commented Mar 14, 2018