Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on JSONata handling of (nested) singleton arrays #93

Closed
tomklapiscak opened this issue Nov 8, 2017 · 8 comments · Fixed by #118
Closed

Clarification on JSONata handling of (nested) singleton arrays #93

tomklapiscak opened this issue Nov 8, 2017 · 8 comments · Fixed by #118

Comments

@tomklapiscak
Copy link

I am working on a Java implementation of the JSONata engine, and I'm struggling to understand JSONata's treatment of singleton arrays...

From http://docs.jsonata.org/complex.html:

Within a JSONata expression or subexpression, any value (which is not itself an array) and an array containing just that value are deemed to be equivalent.

Firstly, if so - why does the following hold?

[1] = 1 => false (should this not be true?)

So, outside of the '=' operator itself, I suspect the singleton/primitive equivalence is effectively achieved by "unwrapping" singleton arrays as they are referenced by expressions.

I see the following behaviour:

{'a': 1 }.a => 1
{'a': [1] }.a => 1
{'a': [[1]] }.a => 1
{'a': [[[1]]] }.a => [1]
{'a': [[[[1]]]] }.a => [[1]]

so it seems to me that JSONata is applying (at most) two levels of flattening to nested singleton arrays pulled out of an associative array via a path. Why 2? Why does it not completely flatten the output array in cases like this?

Next, if I wrap the statements above in an array constructor, I see the the following:

[{'a': 1 }.a] => [1]
[{'a': [1] }.a]=> [1]
[{'a': [[1]] }.a] => [1]

these results seem reasonable (given that the embedded statement returns 1 in those cases), however:

[{'a': [[[1]]] }.a] => [1]

should this not be [[1]]? (given than the statement inside the array constructor evaluates to [1]?

@andrew-coleman
Copy link
Member

Hi Tom, I'm really glad you raised this because it has prompted me to revisit one of the core design points of this language in an attempt to put it on a more robust footing. The idea of collating the results of a location path into a flat array is inspired by the semantics of XPath/XQuery. In XQuery, all results (and intermediate values) are contained in a sequence. A sequence is a flat array with the property that combining or attempting to nest sequences always results in a single flattened appended sequence. It also has the property that a singleton sequence is equivalent to the value contained within it and can be used interchangeably within sub-expressions. These two properties have been used in JSONata, but using JSON arrays rather than a new sequence type.

I think from a pragmatic point of view, this has been good enough; but your examination as an implementor of this language, has exposed a few oddities that need to be ironed out. The issue with all of the expressions you have presented above is that JSONata, in its current implementation is treating all arrays as if they had this sequence property regardless of how they were created. On top of that, it is only flattening one level deep in order to allow nested arrays to be created (although you found an example where is was flattening 2 levels).

This was not really my original intention. I tried to keep the sequence terminology out of the language and only talk about arrays, but I think that was a mistake. So I propose to introduce the concept of a result sequence. It is an array, but has the properties of flattening and singleton equivalence described above (in common with XPath/XQuery). These result sequences are returned by location paths and any expressions derived from these paths. E.g. Account.Invoice, Account.Invoice.Product.(Price * Quantity). On serializing the result of an expression, these sequences look like ordinary JSON arrays ([...]). The two occasions whereby an array is returned that is not a sequence is when the expression (or subexpression) is an array constructor (e.g. [1,2,3], [foo, bar]) or when the location path matches an array value in the input data.
E.g. when the input is:

{
  "payload": {
    "foo": 1,
    "bar": [1],
  }
}

payload.foo => 1 and payload.bar => [1] (at the moment they both evaluate to 1).

Making this distinction will make all of your expressions behave as you would expect. E.g.
{'a': [[1]] }.a => [[1]]
but will preserve the existing behaviour of querying more typical data (it's interesting to note that only 3 or 4 unit tests out of more than 1000 failed as a result of implementing this change, and they all involved queries like the payload.bar example above).

There is a branch called singleton where I have made this change. It can be loaded in the exerciser using:
J:branch singleton followed by newline.
As always, please feel free to try this out and comment. Let me know if you find anything that seems inconsistent. This will be a candidate for the v1.5 release.

@tomklapiscak
Copy link
Author

tomklapiscak commented Nov 21, 2017

Thanks Andrew, that makes sense. Been playing around with the singleton branch in the exerciser and I'm seeing some strange results. I'll maintain a list of these as I spot them here:

  • Inconsistent results when using inlined context (everything in right hand pane) vs root context (left hand pane).

The following expression, with the path expression applied directly to inlined JSON, seems to be working as I'd expect:
[{"a":[1,2]}, {"a":[3]}].a => [[1,2], [3]] (correct)

However, if I pull out [{"a":[1,2]}, {"a":[3]}] into the left hand pane and run the expression a I see:
[1,2, [3]] (incorrect)

Another example of this behaviour:
[{"a":[{"b":[1]}, {"b":[2]}]}, {"a":[{"b":[3]}, {"b":[4]}]}].a[0].b => [[1], [3]] (correct)
[{"a":[{"b":[1]}, {"b":[2]}]}, {"a":[{"b":[3]}, {"b":[4]}]}].a.b[0] => [[1], [3]] (incorrect)

But when using root context:
a[0].b => [[1], [3]] (correct)
a.b[0] => [1,2,3,4] (correct)

@andrew-coleman
Copy link
Member

Thanks Tom. I've just pushed a new commit which hopefully sorts out these inconsistencies.

@cjolif
Copy link

cjolif commented Dec 4, 2017

@tomklapiscak are you planning to open source that Java port?

@xogeny
Copy link
Contributor

xogeny commented Jan 11, 2018

I see a potential problem here, but maybe I'm missing something. One of the things I noticed when looking through the evaluation code in jsonata is that it always works on "raw" Javascript values (vs. using some kind of expression tree AST).

This isn't a real problem, per se. It makes it easy to pass Javascript values in and get Javascript values out and it probably helps out a lot on performance because you don't have to box and unbox values into AST nodes. But, it is limiting because it constrains you to have a 1:1 mapping to Javascript values for your semantics. The potential issue I see here with this notion of a "result sequence".

Because you our operating with pure Javascript values (and I want to emphasize...I'm not saying that is a bad idea), you won't have the luxury of introducing a new expression tree node to represent them. As such, you'll have to have rules like you've described for keeping track of which ones had their origins as "result sequences" and which ones were just arrays. My concern is that this could lead to lots of special cases in the evaluation code and that, in turn, could lead to lots of subtle bugs (like the ones @tomklapiscak is talking about when switching contexts).

But I could be completely off base here.

@shrickus
Copy link

I'll admit that this whole discussion is intriguing, but all seemed a bit "academic" to me -- before now...

It seems I'm running into some of these same edge cases, and I'd like to get this group's take on whether what I'm trying to do makes sense, or I'm just not understanding the syntax enough to choose the right incantations...

My use case is a payload of query results from a relational database, which typically has this structure:

{
  "payload": [
    [
      {
        "sensor_data_id": 3,
        "device": "ds18b20",
        "sensor": "temp",
        "value": 19,
        "epoch": 1516121776982
      },
      {
        "sensor_data_id": 6,
        "device": "ds18b20",
        "sensor": "temp",
        "value": 20,
        "epoch": 1516121836985
      },
... additional readings ...
    ],
    [
... additional series ...
    ]
  ]
}

So the payload itself is always an array, but depending on whether the original input is a single query statement, or multiple statements, the number of arrays in the payload array will be either 1 series or more than one. At this point, I cannot find a common syntax that will allow me to work with both an array of 1 array of results, as well as an array of multiple result arrays.

For example, if i need to get a count of the # of series (which equals the number of query results), I would expect $count to work in both cases -- but instead, I see this:

  • $count(payload) => 3 if there are three series of data points, but
  • $count(payload) => 32 when a single query result array contains 32 points

Effectively, the outer array level has been removed when it only contains a single array!

Likewise, I have to use two different expressions to get an array of device names from the 1st object in each series (each query result contains data for only 1 device):

  • [payload.device[0]] => ["ds18b20", "dht11"] when there are two series of data, but
  • [payload[0].device] => ["ds18b20"] returns the correct array of a single device name

So many of my data restructuring tasks have been simplified using Jsonata -- but if i have to code different logic for an array of arrays vs. an array of 1 array, then I'll probably stick to pure javascript just for my own sanity. Please tell me I'm not understanding something, or making some incorrect assumptions.

Steve

@xogeny
Copy link
Contributor

xogeny commented Jan 17, 2018

I haven't yet had a chance to dig into the new singleton changes, but I was thinking about how I'd like it to work.

It seems to me that this flattening arrays and/or promoting 1-length arrays to scalars really only needs to be done in the context of certain operators. The obvious ones are . and [ (predicate). In a sense, the flattening isn't automatically done to all expressions, just as a pre-processing step to some operators.

So to @shrickus' point, something like $count(payload) is a bit confusing because it seems as though this "promote to scalar for 1-length array" is kicking in here. But do we always want this flattening to occur before all functions? Or, would it be better to have something in the type signature of a function to indicate that we should expect a flattened array vs. non-flattened array and handle that during evaluation (depending on the function)?

Also, you have funny cases (and I saw one in the singleton branch) where if you have an array constructor, for example, at the end of a path, you suppress the normalization. I think it probably be better to think of this normalization as "happening only before certain operations" vs. "not happening after certain operations". I think you'll end up with fewer edge cases.

I was hoping to spend some time this afternoon looking over the singleton changes to see if I could find some semantics that had fewer edge cases but didn't break any tests/coverage. But @shrickus' comments have me wondering if we might actually want to change the semantics a little bit to make them more intuitive?

@andrew-coleman
Copy link
Member

@shrickus I agree that the $count(payload) seems a bit odd - I'll dig into that one a bit more.

The reason for [payload.device[0]] => ["ds18b20", "dht11"] is because the [] operator binds tighter than the ., so this expression says 'for each item in the payload array, select the first item in the device array'. What you are after (I think) is the first device in all of the payload.device arrays. You can get this using (payload.device)[0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants