-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic Issue in Object Grouping #163
Comments
Looking into this a bit more, I realize that my alternative suggestion of storing the expression with input value (instead of assuming only the first expression encountered for a given key expression) doesn't work for expressions like:
But this brings up another issue with this expression. The
...you get a different answer. This is quite surprising and unintuitive for the reasons I gave in #160. |
So I have another alternative to propose which is to avoid "accumulating" values sharing a key. To understand the proposal, consider the current semantics. If we have the following input:
Because we have a repeated value for
Note how
...when it encounters multiple values for the same key. Note that it uses the Thinking about it, I think a much more intuitive approach would be for
In this approach, the values don't accumulate. Instead, the last value overwrites any previous values. This has the advantage that a) it doesn't really impact most cases (I suspect people using this feature don't expect to get multiple values for a given key), b) it is more intuitive and easier to reason about, c) it doesn't require adding a
The thinking here is that by itself,
Then, passing this result into this (hypothetical)
It is worth noting that with the semantics I'm proposing, the following two expressions would yield the same results (at least based on my current testing):
...while, as already noted, the current (
The bottom line is that I think we have to do something because the current semantics really do not handle the case of multiple keys properly. This proposal is an attempt to come up with an alternative that gives the same results for most cases, is easy to implement, is intuitive and easy to reason about and handles both major cases (keep only the last value, keep all values). |
One of the prime motivations for this syntax and semantics was to be able to do the equivalent of an SQL 'group by / aggregation' expression whereby you can create summary data to address questions such as "give me the total sales of product X per region" or "... per sales manager". An example I have presented in the past involves taking the daily download figures of JSONata from NPM and summarising them as total downloads per month. Clearly there are some problems with overlapping keys when there are multiple key/value expressions as you have identified. I will try and spend some time to see if there is a way to resolve these without losing sight of the original goal. As always, many thanks for your insightful analysis on these issues. Please do keep it coming! |
OK, that is useful in understanding the use case. In that case my Hmm...this is a bit tricky. The fundamental issue I see here is that you want to have a one-to-one mapping between a particular key value and an expression. But the problem is that the key can itself be an expression so what you really have is a one-to-one mapping between the key expression and the value expression. Here is a simple (albeit contrived) example that triggers the issue. Consider the following input: [
{ "type": "a", "kind": "a", "value": 0 },
{ "type": "a", "kind": "b", "value": 1 },
{ "type": "b", "kind": "a", "value": 2 },
{ "type": "b", "kind": "b", "value": 3 }
] Now if we attempt to evaluate this expression:
(i.e., compute the average for all values based on their type and compute the sum based on their kind) We get: {
"a": 0.75,
"b": 9
} In this case, the
The same expression evaluates to:
This is because in the previous case, the expression associated with Now I understand what you are trying to do and I suspect it could still be achieved using transforms but at a considerable cost in terms of complexity of the expression. Yet another alternative would be to simply detect cases where a given key value wound up having different value expressions associated with it and flagging that as some kind of semantic error. It is a little distasteful (in part because one can never know what input data will flow into the expression and, as such, it would be hard to predict such errors). On the other hand, in thinking about this for a while I was hard pressed to come up with a real-world use case that would actually trigger such a semantic error. So it may be that it would only come up in contrived circumstances. In fact, I suspect even having key expressions (other than string literals) will be somewhat uncommon in the first place and key expressions in the same object that sometimes overlap will be even rarer. But I think what is clear is that something needs to be done because the current evaluation semantics don't make any sense for these contrived cases. We either need semantics that make sense in all cases or generate semantic errors when it becomes impossible to understand the users intent (as in the mixed key case). |
@andrew-coleman and I met and decided that the best approach would be to create a new class of semantic errors that detect when this has happened. |
This includes fixes for jsonata-js#163 (including detection of the new semantic error). Note that I've changed the code the evaluates object construction to map more closely to how v1.5+ did it. Still having issues with the flattening cases, but I wanted to checkpoint here since all the previously passing tests are working.
…a-js#163) 1034 cases passing, none failing.
I noticed, while reviewing the evaluation code, that there is what I think is an incorrect assumption. The evaluation for objects appears to work as follows:
We loop over each input value. We evaluate the
key
expression in the context of that input value. We then append that input value to the list of values we will use later for that key (more on that in a second). We also note the expression associated with that key and record that as well so we can evaluate that value expression in the context of each of the inputs we are storing away. But here is the problem...the value expression isn't always the same but we only store the first one we run into. I'm pretty sure this isn't correct and it creates some odd issues. For example, imagine our input (for all cases I'm going to consider) is:Now if I evaluate this:
{ key: value }
, I get pretty much what you might expect, i.e.,{ "foo": 5, "bar": 10 }
.So far so good. If I evaluate
{ key: (value+1) }
, I also get pretty much what I would expect:{ "foo": 6, "bar": 11 }
.Now how about if I evaluate:
{ key: value, key: value }
. Now we have overlapping values for the key expressions. That's actually ok and it works (and I think it is reasonable to allow overlapping key expressions, BTW). It evaluates to:I'm no so sure this is the most intuitive result. Some might expect the second to overwrite the first in this case (i.e., some might expect
{ "foo": 5, "bar": 10 }
), but it doesn't. Instead, it appends the results. But at least we get an answer that we can explain even if it might seem unintuitive.But now things start to misbehave. Imagine I evaluate this instead:
{ key: value, key: value+1 }
. I get the same answer as for{ key: value, key: value }
. This is becausevalue
is tucked away as the expression associated with"foo"
and"bar"
because it is the first one encountered, sovalue+1
is never evaluated. But things can actually get worse. If you switch the order and try to evaluate{ key: value+1, key: value }
you actually get an error that it cannot evaluatevalue
because it isn't a number.Now, one thing you can do is change the semantics slightly so you actually evaluate the
value
expressions immediately in the context of each input value (rather than storing them up). This actually makes sense for thevalue+1
case and seems reasonably intuitive. Said another way, it may be more intuitive to do the evaluate this way because I suspect it is what people would expect. But, this will actually breaks one of the test cases, i.e., the containing the expression:Account.Order.Product{$string(ProductID): (Price)[0]}
Because the
[0]
in there become meaningless becausePrice
always evaluates to a scalar rather than an array. So that approach would change the semantics slightly. Presumably the[0]
is there specifically to grab the first value.The alternative is to append not just each successive input value associated with each key but also each unique value expression as well (instead of just the first one). I suspect you'd probably want this alternative but I wanted to thoroughly explain the problem and collect feedback to be sure.
To be honest, implementing these evaluation semantics is a bit complicated. Furthermore, that error for
value+1
is pretty confusing and unintuitive. So before working on implementing anything more (I've implemented a couple different approaches already), I thought it would be good to discuss what is most intuitive.Any comments?
The text was updated successfully, but these errors were encountered: