Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for wildcard fields #60933

Closed
jpountz opened this issue Mar 23, 2020 · 18 comments
Closed

Support for wildcard fields #60933

jpountz opened this issue Mar 23, 2020 · 18 comments
Labels
Feature:Data Views Data Views code and UI - index patterns before 8.0 Feature:KQL KQL Feature:New Field Type Add support for an Elasticsearch field type in Kibana impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:medium Medium Level of Effort

Comments

@jpountz
Copy link

jpountz commented Mar 23, 2020

Elasticsearch has a new wildcard field that mostly behaves as a keyword field but runs wildcard queries more efficiently.

Relates to elastic/elasticsearch#53175 and #35481.

@streamich streamich added Team:AppArch Team:Visualizations Visualization editors, elastic-charts and infrastructure triage_needed labels Mar 24, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app (Team:KibanaApp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-arch (Team:AppArch)

@timroes
Copy link
Contributor

timroes commented Mar 24, 2020

Thanks for creating this. In general it would be helpful if you state something like "mostly behaves the same" if you could list the differences, since they might have a high impact on whether and how we can solve that issue or not. Especially useful are answers to the questions:

  • Does it support all queries exactly the same as a keyword field?
  • Does it support all aggregations exactly the same as a keyword field?
  • Are there any specifics around that field in _source or docvalues?

But in general every API/behavioral difference to the keyword field would be very helpful :-)

@markharwood
Copy link
Contributor

markharwood commented Mar 25, 2020

The wildcard field compares to keyword field as follows:
I think the differences come down to:

Feature keyword wildcard
Sort by speeds Fast Not quite as fast (*caveat 1)
Aggregate speeds Fast Not quite as fast (*caveat 1)
Prefix query speeds (foo*) Fast Not quite as fast (*caveat 2)
Leading wildcard query speeds on high-cardinality fields (*foo) Terrible Much faster
Term query. full value match (foo) Fast Not quite as fast (*caveat 2)
Fuzzy query. Y (if allow expensive queries enabled) N
Regex query. Y (if allow expensive queries enabled) N
Range query. Y (if allow expensive queries enabled) N
Disk costs for mostly unique values high lower
Disk costs for mostly identical values low medium
Max character size for a field value 256 for default JSON string mappings, 32,766 Lucene max unlimited

While @jimczi and @jpountz have thought of this as predominantly a keyword field with wildcard optimisations I think the last feature in this table is important. For large machine-generated content such as:

  1. Our own CI build output
  2. Elasticsearch log files with big stack traces

With values >32k we physically can't use keyword fields due to a Lucene limit but equally we might not want to treat the content as a text field because

  1. We don't want to complicate indexing by having to consider which characters like ., /, \ etc are word-separators
  2. We don't want to complicate grep-like searches using wildcards by breaking character sequences along indexed word boundaries and assembling them using bool or interval queries.

In these cases, the answer to the usual "keyword or text?" question is "neither" and wildcard might be a suitable alternative. In this context of handling big-machine-generated values it probably is not a good idea to attempt using it for aggregations or sorting. (What protection should we have for that Jim/Adrien?).

@timroes
Copy link
Contributor

timroes commented Mar 27, 2020

Thanks for the detailed comparison. This is really helpful. While it looks nearly the same, there is one thing that will make a difference for Kibana:

  • The lack of range queries for wildcard fields, can break in KQL. We don't expose the range query on keyword fields in the filter UI, but you can write KQL queries using > and < on keyword fields. Since they don't work for wildcard fields, there would need to be some special handling for those in KQL.

I'll remove the KibanaApp label from this, since given the list above there is nothing outside App Arch area that would require additions (assuming that we would still mark this as string type in Kibana, and make the difference in KQL based on the esType stored in the index pattern).

@timroes timroes added Feature:New Field Type Add support for an Elasticsearch field type in Kibana Feature:KQL KQL and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure triage_needed labels Mar 27, 2020
@markharwood
Copy link
Contributor

markharwood commented Mar 27, 2020

This is a long comment so the "TL/DR" is I think it's worth Kibana giving wildcard fields some special treatment in log message analytics.

Wildcards in log message analytics

Whenever I'm helping support diagnosing elasticsearch cluster failures we have to sift through large log files and I use elasticsearch+kibana. The log messages can be big -here's the range of logged message sizes from a recent typical case:

Kibana

These fall beyond what would be useful or possible to map as keyword fields so I index as text (and am still finessing what is a good Analyzer setup for this content).
In an ailing cluster there's a lot of message repetition (albeit with near-duplicates not exact duplicates). Effective investigation relies on identifying the different types of message and either removing them from the clutter or plotting on a timeline to see the sequencing and volume of events e.g.
Kibana-3

Identifying the message type involves copying and pasting parts of the log as a query clause which is where the problems come in. Let's take this example of using a mouse to select the part of a message about a particular failing node - NodeNotConnectedException: [54b_data_2]
Kibana-2
However, this selection will not work as a query and is something I struggle with constantly. With a text field the user has to know about the details of the tokenisation policy of where words end and begin to formulate a query. While the selection can be placed in quotes to ensure multi-words are run as a phrase query, particular attention has to be paid to word beginnings and endings. The NodeNotConnectedException part of the selection cuts a token in half because with my Analyzer dots are retained. So the first word needs to be backed up to org.elasticsearch.transport.NodeNotConnectedException. If a similar token-clipping occurs at the selection end we must add a * to the end of the search string. This is painful.

With the wildcard field these sorts of selections could be handled simply - the user selection is wrapped with asterisks and it matches in a predictable way without the searcher or the elastic db admin having to consider tokenisation policies. It does make me wonder how KQL or filter bars may organise these selections (KQL may be clunky if the copy/pasted values contain special chars and filter pills aren't easily ORed).

I see little or no use for sorting or aggregations on a log message field like this so I wonder if we should have the option to disable that particular wildcard field behaviour either at the elasticsearch level or the kibana level.

Maybe we need to think of the "wildcard-on-big-log-messages" and "wildcard-on-shorter-keyword-like fields" as two distinct use cases in Kibana/elasticsearch?

@markharwood
Copy link
Contributor

Related - a regex debugger would be very useful: #66735

@jpountz
Copy link
Author

jpountz commented Jul 22, 2020

@markharwood can this be closed now that wildcard fields pretend to be keyword fields in the _field_caps API? I'm expecting Kibana support for wildcard to come for free?

@markharwood
Copy link
Contributor

markharwood commented Jul 22, 2020

can this be closed now that wildcard fields pretend to be keyword fields

I still have a suspicion large wildcard fields shouldn't be included in Kibana's drop-down lists for sorting or aggs along with the "proper" keyword fields. Admins and users alike will be frustrated by the circuit-breaker exceptions these would cause.

We know wildcard will be useful on large fields and we removed any "ignore_above" limits for them. I just can't see large fields making sense for sorting or aggs. Not sure how Kibana adds protection for that.

@webmat
Copy link

webmat commented Jul 22, 2020

I was just now discussing how I expect we'll want to use wildcard for fields such as error.stack_trace... So I agree some problems could be lurking if users try to do aggregations on those

@jpountz
Copy link
Author

jpountz commented Jul 23, 2020

@markharwood I'm seeing this as an orthogonal issue that shouldn't be Kibana's concern, but Elasticsearch: If a field shouldn't be aggregated via Kibana, then it shouldn't be reported as aggregatable in _field_caps. So I'd suggest closing this issue and raising the question of how Elasticsearch should report large wildcard/keyword values such as stack traces.

@markharwood
Copy link
Contributor

markharwood commented Jul 23, 2020

If a field shouldn't be aggregated via Kibana, then it shouldn't be reported as aggregatable in _field_caps

Good point.
I'll open an elasticsearch issue.

I'm not convinced there's nothing left to be thought about in Kibana-land.
For example - if they support a *foo* style query in the KQL bar and assume, like normal whole-term based queries, that can be run across multiple fields then it may result in slow results or timeouts. Wildcard fields will be fast but hitting other fields which are keyword will involve an expensive linear scan. They might want to think about how to manage those inequalities with these expensive queries.

@jpountz
Copy link
Author

jpountz commented Jul 23, 2020

As wildcard fields can't be distinguished from keyword fields from Kibana, I think that this one should be a question for Elasticsearch too?

@markharwood
Copy link
Contributor

As wildcard fields can't be distinguished from keyword fields from Kibana, I think that this one should be a question for Elasticsearch too?

That sounds like adding a different field-expansion list for wildcard/regex queries than the existing general-purpose one?
Might be some BWC things to consider with any change there.

As for the aggregatable Y/N question, there's 2 options

  1. static - @colings86 and I discussed about adding a possible wildcard_text type to signal the supported use cases
  2. dynamic - es admin can disable aggs using a field caps change.

With 2) there's questions about how Kibana might pick up a change in elasticsearch field_caps too if we make that dynamic. Maybe that's just a manual index-pattern refresh in Kibana.
Do we already have an issue for making field_caps dynamic?

@jpountz
Copy link
Author

jpountz commented Jul 23, 2020

No I don't. For the record, it might also be ok to not do anything and rely on circuit breakers to abort aggs on stack traces.

@markharwood
Copy link
Contributor

it might also be ok to not do anything and rely on circuit breakers to abort aggs on stack traces.

I think that was Jim's working assumption - the question is whether users and admins are going to be happy with that.

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Jun 2, 2021
@mattkime mattkime added the Feature:Data Views Data Views code and UI - index patterns before 8.0 label Oct 13, 2021
@exalate-issue-sync exalate-issue-sync bot added loe:medium Medium Level of Effort and removed loe:small Small Level of Effort labels Nov 19, 2021
@rayafratkina
Copy link
Contributor

@mattkime @petrklapka is this closed by mistake or actually confirmed to be working?

@mattkime
Copy link
Contributor

mattkime commented Dec 6, 2021

@rayafratkina Thanks for bringing this to my attention as I should leave some notes -

wildcard fields have been supported as keyword fields since the field caps api started reporting them as such - elastic/elasticsearch#53175

For more refined handling of these fields we'll need a method of identifying them as their true type - #120284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Data Views Data Views code and UI - index patterns before 8.0 Feature:KQL KQL Feature:New Field Type Add support for an Elasticsearch field type in Kibana impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:medium Medium Level of Effort
Projects
None yet
Development

No branches or pull requests

8 participants