Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

hchargois · 2020-09-10T17:30:54Z

Background

The Terms, Significant Terms and Rare Terms aggregations support the include and exclude options to filter the buckets via either:

a single regexp e.g. "this.*|that.*"
a list of exact terms e.g. ["thisTerm", "thatTerm"]

You can give both an include and an exclude at the same time, but they have to be the same type, both regexp-based or set-based. If you mix and match, you get an error (and this is not documented BTW).

        "reason": "[8:34] [terms] failed to parse field [include]",
        "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Cannot mix a set-based include with a regex-based method"
        }

The problem

The problem we faced is that we need both a regexp-based include and a set-based exclude. We thought about converting the set-based exclude into a regexp-based exclude so that both could be regexps, like so:

["term1", "term2", "term3"] -> "term1|term2|term3"

And obviously both give exactly the same buckets in the aggregation... But the performance is way worse with the regexp.

To give an idea of how worse the performance is, this is the performance of a typical aggregation that we can make on our index (to allow testing and comparing both set and regexp, I removed any include parameter):

without exclude: ~300 ms
with a set-based exclude containing 100 terms: ~500 ms
with a regexp-based exclude containing the same 100 terms: ~10 s

So, while setting a set-based exclude is just a bit slower than no exclude, the regexp-based exclude is 20 times slower than the set-based one. We can't afford that kind of performance unfortunately.

The proposed solution

We would like to lift that limitation about having to use the same type of include and exclude. We want to be able to mix and match both kinds of include with both kinds of exclude. That way, we could use a regexp when we need the flexibility of one, and use a set when we can, to keep performance high.

I've seen the relevant code (server/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/IncludeExclude.java) and just by a bit of refactoring in this file, I've been able to make a proof of concept that achieves that goal. I've confirmed that mixing a regexp and set give the performance we expect, which is much faster that two regexps.

So, I'm opening this issue to gather feedback before hopefully getting the green light to implement this properly in a pull request.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-09-10T20:27:40Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

polyfractal · 2020-09-14T21:04:47Z

Hiya @hchargois, thanks for raising this issue! I'm going to mark this with a discuss label so that the analytics team can chat about it later this week, will get back to you after we've discussed it.

I think the constraint is largely to simplify the code, and for some performance reasons (for most of the specializations, it's easy to check include/exclude at the same time). Splitting include and exclude out into separate checks/specializations should work in theory and probably won't be more expensive, but we might want to work up some benchmarks to verify.

There are also some oddities in there, like iirc num_partitions is only used by the include side.

In any case, it seems like a reasonable request to me... will see what the rest of the team thinks!

hchargois · 2020-09-16T18:05:56Z

Thanks for your reply!

For reference, here is the patch of my proof of concept: hchargois@3d12563

I'm not sure what you mean by "splitting include and exclude out into separate checks/specializations". My idea behind the POC was on the contrary to unify the TermListBacked and AutomatonBacked filters. It was easy enough after realizing that all that's needed to correctly retain the include/exclude precedence is to run the set includes, then the automaton, then the set excludes, in that specific order. All in all, the 2 resulting accept methods are arguably a little bit more complex than the 4 specialized ones but they can handle any combination of include/exclude and overall there's less lines of code and fewer classes.

I didn't try merging the partitions-based filter but I think it shouldn't be too hard either, if desired (I personally don't need it but I guess it may make sense to unify it too).

For the previously supported cases, it does the same computations as before so the performance should be the same. Of course it's just a POC that needs to be refined but I think the general idea is nevertheless correct. And obviously I'm not too attached to this or any particular solution, as long as it solves our performance issue, that would be fine by me.

polyfractal · 2020-09-16T21:19:23Z

Ah, I see. Yeah approach would probably work fine as well :)

We just discussed this a little earlier today in our team meeting, and the consensus is that it seems like a very reasonable enhancement. If you'd like to polish up the POC and send a PR we'd be happy to review, and we can help work up a Rally benchmark on our end if it seems necessary (it might not though depending on the diff, I agree that if it's largely just moving bits around it's probably not necessary).

Thanks for taking this on! :)

polyfractal · 2020-11-11T20:50:33Z

I think this can close, now that #63325 has merged! :)

danhermann added the :Analytics/Aggregations Aggregations label Sep 10, 2020

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 10, 2020

$@polyfractal$ polyfractal added the team-discuss label Sep 14, 2020

$@polyfractal$ polyfractal removed the team-discuss label Sep 16, 2020

hchargois mentioned this issue Oct 6, 2020

Allow mixing set-based and regexp-based include and exclude #63325

Merged

$@polyfractal$ polyfractal closed this as completed Nov 11, 2020

stevejgordon mentioned this issue Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

hchargois commented Sep 10, 2020

elasticmachine commented Sep 10, 2020

polyfractal commented Sep 14, 2020

hchargois commented Sep 16, 2020

polyfractal commented Sep 16, 2020

polyfractal commented Nov 11, 2020

Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246

Comments

hchargois commented Sep 10, 2020

Background

The problem

The proposed solution

elasticmachine commented Sep 10, 2020

polyfractal commented Sep 14, 2020

hchargois commented Sep 16, 2020

polyfractal commented Sep 16, 2020

polyfractal commented Nov 11, 2020