Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indices field to _matchesPosition to specify where in an array a match comes from #5005

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

LukasKalbertodt
Copy link

Pull Request

Related issue

https:/orgs/meilisearch/discussions/550

What does this PR do?

This adds an indices fields to the objects returned in _matchesPosition. The new field describes in what array element in the document the match was found. This was impossible before and users simply did not know where a match originated inside an array.

Example document:

{
  "id": "123",
  "names": ["foo", "bar", "catnip"],
  "noarray": "dog cat fox",
  "nested": [
    ["dog", "cat"],
    ["fox", "bear"]
  ]
}

Searching for cat now returns this:

"_matchesPosition": {
  "names": [
    {
      "start": 0,
      "length": 3,
      "indices": [2]
    }
  ],
  "nested": [
    {
      "start": 0,
      "length": 3,
      "indices": [0, 1]
    }
  ],
  "noarray": [
    {
      "start": 4,
      "length": 3
    }
  ]
}

Having indices be an array is required due to nested arrays, so one sometimes needs multiple indices to know what data the match comes from.

Alternative API designs

An alternative design would be to include the indices inside the key of _matchesPosition, e.g. foo.bar[2].baz. This is more intuitive to me and puts all "location inside document" information into one place, but has some disadvantages: the index needs to be parsed out of the key, which is annoying for end users. Also, JSON fields can have keys containing [2], so escaping would be necessary. Or one could insert . dots before the [2] (e.g. foo.bar.[2].baz) which might make parsing easier.

Another alternative could be to just include the full path inside the object, e.g.:

{
  "start": 4,
  "length": 3,
  "path": ["foo", "bar", 2, "baz"]
}

String elements would mean fields in an object, numbers would mean indices into an array. This is nice as all path information is in one place. This this, unlike with the current design (of this PR), you would also not need to know the document structure to understand at what levels the indices actually apply.

Of course, the disadvantage is that there is duplication with the keys inside _matchesPosition. One could also convert _matchesPosition to an array, but that's quite the breaking change and it would make some use cases more annoying.

In summary: I personally am fine with all three designs. The implicit "you have to know where the indices go" of the current design is not too bad; the parsing of the foo.bar[2].baz approach also seems ok; adding the nicely typed full path also probably doesn't hurt too much thanks to compression. Let me know what you think! I can change this PR to switch to another approach.

@ManyTheFish
Copy link
Member

Hello @LukasKalbertodt,
is your PR ready for review? If it's not, could you please convert it as a draft PR?

Thanks!

For matches inside arrays, this field holds the indices of the array
elements that matched. For example, searching for `cat` inside
`{ "a": ["dog", "cat", "fox"] }` would return `indices: [1]`. For nested
arrays, this contains multiple indices, starting with the one for the
top-most array. For matches in fields without arrays, `indices` is not
serialized (does not exist) to save space.
@LukasKalbertodt
Copy link
Author

@ManyTheFish The PR is ready for review (now that CI should be fixed...). The alternatives in the descriptions are just considerations for you, to decide what path to choose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants