Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

Closed
hchargois opened this issue Sep 10, 2020 · 5 comments
Closed
Labels
:Analytics/Aggregations Aggregations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@hchargois
Copy link
Contributor

Background

The Terms, Significant Terms and Rare Terms aggregations support the include and exclude options to filter the buckets via either:

  • a single regexp e.g. "this.*|that.*"
  • a list of exact terms e.g. ["thisTerm", "thatTerm"]

You can give both an include and an exclude at the same time, but they have to be the same type, both regexp-based or set-based. If you mix and match, you get an error (and this is not documented BTW).

        "reason": "[8:34] [terms] failed to parse field [include]",
        "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Cannot mix a set-based include with a regex-based method"
        }

The problem

The problem we faced is that we need both a regexp-based include and a set-based exclude. We thought about converting the set-based exclude into a regexp-based exclude so that both could be regexps, like so:

["term1", "term2", "term3"] -> "term1|term2|term3"

And obviously both give exactly the same buckets in the aggregation... But the performance is way worse with the regexp.

To give an idea of how worse the performance is, this is the performance of a typical aggregation that we can make on our index (to allow testing and comparing both set and regexp, I removed any include parameter):

  • without exclude: ~300 ms
  • with a set-based exclude containing 100 terms: ~500 ms
  • with a regexp-based exclude containing the same 100 terms: ~10 s

So, while setting a set-based exclude is just a bit slower than no exclude, the regexp-based exclude is 20 times slower than the set-based one. We can't afford that kind of performance unfortunately.

The proposed solution

We would like to lift that limitation about having to use the same type of include and exclude. We want to be able to mix and match both kinds of include with both kinds of exclude. That way, we could use a regexp when we need the flexibility of one, and use a set when we can, to keep performance high.

I've seen the relevant code (server/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/IncludeExclude.java) and just by a bit of refactoring in this file, I've been able to make a proof of concept that achieves that goal. I've confirmed that mixing a regexp and set give the performance we expect, which is much faster that two regexps.

So, I'm opening this issue to gather feedback before hopefully getting the green light to implement this properly in a pull request.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 10, 2020
@polyfractal
Copy link
Contributor

Hiya @hchargois, thanks for raising this issue! I'm going to mark this with a discuss label so that the analytics team can chat about it later this week, will get back to you after we've discussed it.

I think the constraint is largely to simplify the code, and for some performance reasons (for most of the specializations, it's easy to check include/exclude at the same time). Splitting include and exclude out into separate checks/specializations should work in theory and probably won't be more expensive, but we might want to work up some benchmarks to verify.

There are also some oddities in there, like iirc num_partitions is only used by the include side.

In any case, it seems like a reasonable request to me... will see what the rest of the team thinks!

@hchargois
Copy link
Contributor Author

Thanks for your reply!

For reference, here is the patch of my proof of concept: hchargois@3d12563

I'm not sure what you mean by "splitting include and exclude out into separate checks/specializations". My idea behind the POC was on the contrary to unify the TermListBacked and AutomatonBacked filters. It was easy enough after realizing that all that's needed to correctly retain the include/exclude precedence is to run the set includes, then the automaton, then the set excludes, in that specific order. All in all, the 2 resulting accept methods are arguably a little bit more complex than the 4 specialized ones but they can handle any combination of include/exclude and overall there's less lines of code and fewer classes.

I didn't try merging the partitions-based filter but I think it shouldn't be too hard either, if desired (I personally don't need it but I guess it may make sense to unify it too).

For the previously supported cases, it does the same computations as before so the performance should be the same. Of course it's just a POC that needs to be refined but I think the general idea is nevertheless correct. And obviously I'm not too attached to this or any particular solution, as long as it solves our performance issue, that would be fine by me.

@polyfractal
Copy link
Contributor

Ah, I see. Yeah approach would probably work fine as well :)

We just discussed this a little earlier today in our team meeting, and the consensus is that it seems like a very reasonable enhancement. If you'd like to polish up the POC and send a PR we'd be happy to review, and we can help work up a Rally benchmark on our end if it seems necessary (it might not though depending on the diff, I agree that if it's largely just moving bits around it's probably not necessary).

Thanks for taking this on! :)

@polyfractal
Copy link
Contributor

I think this can close, now that #63325 has merged! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

4 participants