Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KQL] Add regex support #46855

Closed
Tracked by #166068
Skoetting opened this issue Sep 27, 2019 · 24 comments
Closed
Tracked by #166068

[KQL] Add regex support #46855

Skoetting opened this issue Sep 27, 2019 · 24 comments
Assignees
Labels
enhancement New value added to drive a business result Feature:KQL KQL Feature:Search Querying infrastructure in Kibana Icebox impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:x-large Extra Large Level of Effort Team:DataDiscovery Discover App Team (Document Explorer, Saved Search, Surrounding documents, Data, DataViews)

Comments

@Skoetting
Copy link

Skoetting commented Sep 27, 2019

Describe the feature:

KQL currently supports wildcard queries using the * character to denote "zero or more characters". It does not support ? to denote "one character", nor does it support searching using full regular expressions.

It would be nice if KQL supported searching using regular expressions. Internally, it could leverage Elasticsearch regexp queries or regex inside a query string query.

The syntax could be something like optionalFieldName: /my-regex-pattern/. (We will need to have a migration so that queries already using this syntax are escaped.)

Related: #126532

@nreese nreese added enhancement New value added to drive a business result Feature:KQL KQL Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Sep 27, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app

@joshuasmith0
Copy link

I am trying to perform a Kibana KQL search on a text field for any value that doesn't end in $

For instance, when parsing Windows Event Logs for successful/unsuccessful logins, I am trying to not show computer accounts (which end with $).

I have looked at sever other questions around this same concept (Regex search where a string field ends with $) but that solution isn't working for me as I it is using lucene, not KQL.

I know that KQL supports wildcards so I was assuming it was going to be a query along the lines of:
not accountName: *$

Full regex support would be helpful in finding these documents.

@timroes timroes added Team:AppArch and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Feb 20, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-arch (Team:AppArch)

@jpountz
Copy link

jpountz commented Jul 22, 2020

To provide additional background, @randomuserid was just explaining to me that lack of regexp support in KQL means that they need to fall back to the Lucene search syntax whenever they need regexps, here is an example for instance: https:/elastic/detection-rules/blob/main/rules/linux/privilege_escalation_setgid_bit_set_via_chmod.toml. So there's no urgency to support regexps since we can fall back to Lucene, but it would be better if KQL supported regexps so that we would no longer need to fall back to Lucene in such cases.

@rw-access rw-access self-assigned this Jul 22, 2020
@rayafratkina
Copy link
Contributor

Wondering if we should use the new wildcard field in regexps? In that case #60933 is related

@timroes
Copy link
Contributor

timroes commented Jul 22, 2020

@rayafratkina the regex query should work on every field that supports it in KQL. A user though would do good in using a wildcard field if they know they need to use regexp queries a lot. I'm not sure if we can do anything reasonable to advertise that though from KQL since at that point indexing is already done and it's kind of "too late".

@rw-access
Copy link
Contributor

rw-access commented Jul 22, 2020

@jpountz do wildcard and regexp perform equivalently when possible? For example, does the wildcard foo*bar*baz perform identically as /foo.*?bar.?baz/?

I think the wildcard functionality of KQL is a little flaky from what I've seen and those currently convert to query_string. Wondering if regex would be preferred if you're avoiding issues with double escaping, etc. that come from wildcards

@jpountz
Copy link

jpountz commented Jul 23, 2020

@rw-access sorry I'm not sure I get the question. Did you mean /foo.*bar.*baz/ as a regexp? (question marks shouldn't be required as * means "0 or more times", not "1 or more times"). But otherwise yeah, wildcard and regexp queries execute exactly the same way internally: Lucene first converts the expression to an automaton, and then runs the query using this automaton. If a wildcard expression and a regexp translate into the same automaton then they'll match exactly the same documents.

@timroes
Copy link
Contributor

timroes commented Jul 23, 2020

I wonder if we should have a discussion here first maybe how we want the syntax of regex queries in KQL to look like? Due to backwards compatibility reasons we cannot use the Lucene way field : /someregex/, since that would already have been a valid query beforehand and we can't change the meaning of existing queries. Thus I'd suggest we use a custom operator between the fieldname. So far we're having :, : *, >=, >, <=, < as operators.

Given that we only treat the following characters as special characters, which would need to be escaped in a value: \():<>"*{} we could only use one of those after the : if we want to create an operator that is :<some character>. All of them already have a meaning and thus we can't use them. Meaning we need to either use a combination of <some character>: or a completely separate operator, e.g. field ~ regex or field ~: regex. I am not sure if there if anyone has currently any preference for how the regex operator should look like. I don't have a strong preference, but think the tilde ~ might be a good choice, since it's commonly used to mean "aroundish", which it was regex are used for often. But please share your thoughts about what you think would be a good regex operator?

cc @ppisljar @lukasolson

@rw-access
Copy link
Contributor

rw-access commented Jul 23, 2020

@jpountz ah, I was using .*? to indicate non-greedy. That's generally how I convert wildcard to regex for EQL, but it looks like that's an optimization for PCRE, not Lucene regex.

@timroes wouldn't that syntax be subject to the same problem? Since ~ isn't a reserved character, it would currently be interpreted as part of the field name.

We've had to make many similar decisions for EQL. One of the guiding principles for changes was that we won't reinterpret syntax that's already valid with new semantics, unless is was truly a bug. For breaking changes or limiting the syntax, we decided that we should still accept the syntax in the grammar, so that we can recognize it and raise an error message. That seems to be a good path forward for us .

I think that means ~: might be out. But we could do something like field : match(/foo.*bar.*baz.*/). It could open the door to more functions, and we could do field : wildcard("foo*bar*baz"). It would still be valid within a list of values joined by or or and.

Or we introduce a new predicate instead of : <list of values>. field LIKE "..." or field RLIKE /foo.*bar.*baz/. We wouldn't need a special character since it's not currently valid syntax.

Thoughts? It's not great, but our options are limited. And I think the feature is desired enough — both by internal Elastic teams and our users — that we might have to pick a syntax that's less than ideal.

@markharwood
Copy link
Contributor

markharwood commented Jul 23, 2020

There are additional concerns about how to expose the important regex options of case insensitivity. This is done in other engines using /..../i syntax (the i meaning insensitive).

Symptoms of a broader issue - KQL is becoming a bottleneck to putting functionality in users hands.

As long as KQL is the top-level means for users to assemble clauses with Boolean logic we will have issues :

  • we run out of special syntax characters
  • there's no encapsulation - all details are laid bare in string form
  • illegal syntax is easy to introduce
  • users have to escape everything
  • there's no helpful checkboxes etc for setting options

With the Sculptor object model as a top-level organiser for Boolean logic :

  • complex clauses like regex can have their own dedicated GUI editor with help text and arbitrary options - KQL parser changes are no longer a bottleneck
  • the things we clicked (aka "Filter pills") can be ORed with the things users typed (KQL). There's no good reason why things-you-click should be assumed are always to be ANDed and not be OR-able.
  • KQL can still exist as a data-entry form but also be converted to more editable objects (muting, NOTing, expand/collapse, setting advanced options like boosts etc).

@timroes
Copy link
Contributor

timroes commented Jul 23, 2020

@rw-access As far as I understand the grammar atm, the fieldname can not have (unquoted) spaces, thus we know that the operator is part of the field name. Maybe Lukas will be the better candidate for talking about that. I know we also experimented some time with having everything some kind of functions in KQL, which would be more along the lines with your match/wildcard suggestion. I also here need to refer to @lukasolson to give some background information about that.

Regarding flags, even with a custom operator we could still put the regex in /../ and allow flags that way to support it: field ~ /foo/i.

@rw-access
Copy link
Contributor

True, but the downside is that ~: becomes sensitive to whitespace. field ~: value is parsed differently from field~: value or field~:value. Generally, it's good to be consistent in whitespace handling across the grammar, so I'd be worried about this edge case causing confusion. Same applies to field~/foo/i. I believe that's currently valid. We use the more compact syntax in a lot of rules right now. For example:

query = '''
event.category:(network or network_traffic) and network.transport:tcp and destination.port:8000 and
  source.ip:(10.0.0.0/8 or 172.16.0.0/12 or 192.168.0.0/16) and
  not destination.ip:(10.0.0.0/8 or 127.0.0.0/8 or 172.16.0.0/12 or 192.168.0.0/16 or "::1")
'''

Agreed for flags. I was thinking about adding /i as well. I didn't see it as a flag on the 7.9 regexp page or the regular expression syntax page, so I didn't know if it was complete yet or not. We could have shorthand for the other flags as well (COMPLEMENT, INTERVAL, INTERSECTION, ANYSTRING), as long as we pick good letters that don't also start with i. Or we could leave that open for the future and use the default flags, with only the case-sensitivity toggle.

@markharwood
Copy link
Contributor

I didn't see it as a flag on the 7.9 regexp page or the regular expression syntax page, so I didn't know if it was complete yet or not.

We have a PR flip-flopping on what to do - whether to make API concessions that make KQL easier with extended pattern syntax or stick to more formal APIs with named JSON flags.

@markharwood
Copy link
Contributor

markharwood commented Jul 23, 2020

I have a preference for formal JSON APIs in elasticsearch with dedicated editors as counterparts in the GUI to simplify.

We could create a formal query JSON syntax for automatons (char_sequence, ORs, nots, repeats etc).
It would be validatable and we wouldn't have the silent failures we experience currently when someone uses a bit of what they think is valid regex syntax (/i \w or whatever) and it isn't supported but interpreted instead as a search for those literal characters.
A more formal JSON syntax, like we offer for spans/interval queries won't get an outing in Kibana though because the assembly tool for all criteria is KQL. There is no graphical assembly of query objects that can be editable with dedicated editors. Only strings and a brittle, cryptic syntax that struggles to separate content from controls.

@rw-access
Copy link
Contributor

I think that's a fine point to show how you don't think KQL fits your needs or doesn't solve its problem well. But that discussion might be a little easier to have in a separate issue that's better scoped, and we keep the scope of this issue constrained to adding regex support to KQL. I don't mean that at all to shut you down, but just that we keep those discussions separate, since it's already a little hard to keep track of the two.

@timroes
Copy link
Contributor

timroes commented Jul 23, 2020

++ @rw-access There are already issues to discuss this. Please continue discussion in #8112 (Graphical query builder) or #14272 (more control over how filters are added to the filter bar) which are the more appropriate places for discussion around the overall concept of the filter bar. Discussion in this thread should be about the Regex support in KQL so everyone can keep better track about it.

@markharwood
Copy link
Contributor

markharwood commented Jul 23, 2020

++ Happy to keep discussion elsewhere - just wanted to flag that regex construction is complex and adding this might mark a tipping point in how much complexity we try shoe-horn into KQL. We hit this wall 15 years ago in Lucene's query syntax.

We have a proposal for an elasticsearch API that you might want to incorporate as an aid to regex authors. It could help validate that the expressions people write are actually understood correctly by Lucene's parser.

@ppisljar
Copy link
Member

i like the suggestion of using a function field : match(/foo.*bar.*baz.*/i), seems a bit more error prone and also leaves more doors open for the future.

@lukasolson
Copy link
Member

I would prefer to avoid any functional syntax (like match()) since there isn't any other functional syntax in KQL at the moment.

I would definitely prefer to go with something regex users are already used to (like foo: /bar.*/i) but, like @timroes mentioned, this would break backwards compatibility. But I don't think it'd be too hard to add a migration step that escapes leading forward slashes in the "match" clause.

Adding a completely new operator for regex when we already use : for exact match as well as wildcard matches doesn't seem intuitive to me, but it would also make adding autocomplete for regex queries a bit easier.

@markharwood
Copy link
Contributor

A dependency you'll need to track - Lucene PR to add /../i syntax to Lucene's query parser for case insensitive search

@rw-access rw-access removed their assignment Sep 18, 2020
@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Jun 2, 2021
@exalate-issue-sync exalate-issue-sync bot added loe:large Large Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed loe:small Small Level of Effort impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. labels Nov 22, 2021
@lukasolson lukasolson self-assigned this Jan 4, 2022
@lukasolson lukasolson changed the title Add Regex Support to KQL [KQL] Add regex support Apr 1, 2022
@exalate-issue-sync exalate-issue-sync bot added loe:x-large Extra Large Level of Effort and removed loe:large Large Level of Effort labels Apr 6, 2022
@petrklapka petrklapka added Feature:Search Querying infrastructure in Kibana Team:DataDiscovery Discover App Team (Document Explorer, Saved Search, Surrounding documents, Data, DataViews) and removed Team:AppServicesSv labels Nov 23, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

@markniemeijer
Copy link

+100

@kertal
Copy link
Member

kertal commented Oct 21, 2024

Closing this because it's not planned to be resolved in the foreseeable future. It will be tracked in our Icebox and will be re-opened if our priorities change. Feel free to re-open if you think it should be melted sooner.

When using ES|QL in Kibana it's already possible to make use of RexExp e.g. by using RLIKE
https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-rlike-operator

@kertal kertal closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:KQL KQL Feature:Search Querying infrastructure in Kibana Icebox impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:x-large Extra Large Level of Effort Team:DataDiscovery Discover App Team (Document Explorer, Saved Search, Surrounding documents, Data, DataViews)
Projects
None yet
Development

Successfully merging a pull request may close this issue.