Support for `wildcard` fields #60933

jpountz · 2020-03-23T16:37:23Z

Elasticsearch has a new wildcard field that mostly behaves as a keyword field but runs wildcard queries more efficiently.

Relates to elastic/elasticsearch#53175 and #35481.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-24T11:24:52Z

Pinging @elastic/kibana-app (Team:KibanaApp)

elasticmachine · 2020-03-24T11:24:53Z

Pinging @elastic/kibana-app-arch (Team:AppArch)

timroes · 2020-03-24T16:22:55Z

Thanks for creating this. In general it would be helpful if you state something like "mostly behaves the same" if you could list the differences, since they might have a high impact on whether and how we can solve that issue or not. Especially useful are answers to the questions:

Does it support all queries exactly the same as a keyword field?
Does it support all aggregations exactly the same as a keyword field?
Are there any specifics around that field in _source or docvalues?

But in general every API/behavioral difference to the keyword field would be very helpful :-)

markharwood · 2020-03-25T11:11:29Z

The wildcard field compares to keyword field as follows:
I think the differences come down to:

Feature	keyword	wildcard
Sort by speeds	Fast	Not quite as fast (*caveat 1)
Aggregate speeds	Fast	Not quite as fast (*caveat 1)
Prefix query speeds (foo*)	Fast	Not quite as fast (*caveat 2)
Leading wildcard query speeds on high-cardinality fields (*foo)	Terrible	Much faster
Term query. full value match (foo)	Fast	Not quite as fast (*caveat 2)
Fuzzy query.	Y (if allow expensive queries enabled)	N
Regex query.	Y (if allow expensive queries enabled)	N
Range query.	Y (if allow expensive queries enabled)	N
Disk costs for mostly unique values	high	lower
Disk costs for mostly identical values	low	medium
Max character size for a field value	256 for default JSON string mappings, 32,766 Lucene max	unlimited

While @jimczi and @jpountz have thought of this as predominantly a keyword field with wildcard optimisations I think the last feature in this table is important. For large machine-generated content such as:

Our own CI build output
Elasticsearch log files with big stack traces

With values >32k we physically can't use keyword fields due to a Lucene limit but equally we might not want to treat the content as a text field because

We don't want to complicate indexing by having to consider which characters like ., /, \ etc are word-separators
We don't want to complicate grep-like searches using wildcards by breaking character sequences along indexed word boundaries and assembling them using bool or interval queries.

In these cases, the answer to the usual "keyword or text?" question is "neither" and wildcard might be a suitable alternative. In this context of handling big-machine-generated values it probably is not a good idea to attempt using it for aggregations or sorting. (What protection should we have for that Jim/Adrien?).

timroes · 2020-03-27T08:53:46Z

Thanks for the detailed comparison. This is really helpful. While it looks nearly the same, there is one thing that will make a difference for Kibana:

The lack of range queries for wildcard fields, can break in KQL. We don't expose the range query on keyword fields in the filter UI, but you can write KQL queries using > and < on keyword fields. Since they don't work for wildcard fields, there would need to be some special handling for those in KQL.

I'll remove the KibanaApp label from this, since given the list above there is nothing outside App Arch area that would require additions (assuming that we would still mark this as string type in Kibana, and make the difference in KQL based on the esType stored in the index pattern).

markharwood · 2020-03-27T10:32:28Z

This is a long comment so the "TL/DR" is I think it's worth Kibana giving wildcard fields some special treatment in log message analytics.

Wildcards in log message analytics

Whenever I'm helping support diagnosing elasticsearch cluster failures we have to sift through large log files and I use elasticsearch+kibana. The log messages can be big -here's the range of logged message sizes from a recent typical case:

These fall beyond what would be useful or possible to map as keyword fields so I index as text (and am still finessing what is a good Analyzer setup for this content).
In an ailing cluster there's a lot of message repetition (albeit with near-duplicates not exact duplicates). Effective investigation relies on identifying the different types of message and either removing them from the clutter or plotting on a timeline to see the sequencing and volume of events e.g.

Identifying the message type involves copying and pasting parts of the log as a query clause which is where the problems come in. Let's take this example of using a mouse to select the part of a message about a particular failing node - NodeNotConnectedException: [54b_data_2]

However, this selection will not work as a query and is something I struggle with constantly. With a text field the user has to know about the details of the tokenisation policy of where words end and begin to formulate a query. While the selection can be placed in quotes to ensure multi-words are run as a phrase query, particular attention has to be paid to word beginnings and endings. The NodeNotConnectedException part of the selection cuts a token in half because with my Analyzer dots are retained. So the first word needs to be backed up to org.elasticsearch.transport.NodeNotConnectedException. If a similar token-clipping occurs at the selection end we must add a * to the end of the search string. This is painful.

With the wildcard field these sorts of selections could be handled simply - the user selection is wrapped with asterisks and it matches in a predictable way without the searcher or the elastic db admin having to consider tokenisation policies. It does make me wonder how KQL or filter bars may organise these selections (KQL may be clunky if the copy/pasted values contain special chars and filter pills aren't easily ORed).

I see little or no use for sorting or aggregations on a log message field like this so I wonder if we should have the option to disable that particular wildcard field behaviour either at the elasticsearch level or the kibana level.

Maybe we need to think of the "wildcard-on-big-log-messages" and "wildcard-on-shorter-keyword-like fields" as two distinct use cases in Kibana/elasticsearch?

markharwood · 2020-05-15T14:20:02Z

Related - a regex debugger would be very useful: #66735

jpountz · 2020-07-22T18:03:21Z

@markharwood can this be closed now that wildcard fields pretend to be keyword fields in the _field_caps API? I'm expecting Kibana support for wildcard to come for free?

markharwood · 2020-07-22T18:27:18Z

can this be closed now that wildcard fields pretend to be keyword fields

I still have a suspicion large wildcard fields shouldn't be included in Kibana's drop-down lists for sorting or aggs along with the "proper" keyword fields. Admins and users alike will be frustrated by the circuit-breaker exceptions these would cause.

We know wildcard will be useful on large fields and we removed any "ignore_above" limits for them. I just can't see large fields making sense for sorting or aggs. Not sure how Kibana adds protection for that.

webmat · 2020-07-22T19:31:41Z

I was just now discussing how I expect we'll want to use wildcard for fields such as error.stack_trace... So I agree some problems could be lurking if users try to do aggregations on those

jpountz · 2020-07-23T09:41:08Z

@markharwood I'm seeing this as an orthogonal issue that shouldn't be Kibana's concern, but Elasticsearch: If a field shouldn't be aggregated via Kibana, then it shouldn't be reported as aggregatable in _field_caps. So I'd suggest closing this issue and raising the question of how Elasticsearch should report large wildcard/keyword values such as stack traces.

markharwood · 2020-07-23T10:48:00Z

If a field shouldn't be aggregated via Kibana, then it shouldn't be reported as aggregatable in _field_caps

Good point.
I'll open an elasticsearch issue.

I'm not convinced there's nothing left to be thought about in Kibana-land.
For example - if they support a *foo* style query in the KQL bar and assume, like normal whole-term based queries, that can be run across multiple fields then it may result in slow results or timeouts. Wildcard fields will be fast but hitting other fields which are keyword will involve an expensive linear scan. They might want to think about how to manage those inequalities with these expensive queries.

jpountz · 2020-07-23T11:12:27Z

As wildcard fields can't be distinguished from keyword fields from Kibana, I think that this one should be a question for Elasticsearch too?

markharwood · 2020-07-23T11:28:02Z

As wildcard fields can't be distinguished from keyword fields from Kibana, I think that this one should be a question for Elasticsearch too?

That sounds like adding a different field-expansion list for wildcard/regex queries than the existing general-purpose one?
Might be some BWC things to consider with any change there.

As for the aggregatable Y/N question, there's 2 options

static - @colings86 and I discussed about adding a possible wildcard_text type to signal the supported use cases
dynamic - es admin can disable aggs using a field caps change.

With 2) there's questions about how Kibana might pick up a change in elasticsearch field_caps too if we make that dynamic. Maybe that's just a manual index-pattern refresh in Kibana.
Do we already have an issue for making field_caps dynamic?

jpountz · 2020-07-23T11:34:34Z

No I don't. For the record, it might also be ok to not do anything and rely on circuit breakers to abort aggs on stack traces.

markharwood · 2020-07-23T14:08:32Z

it might also be ok to not do anything and rely on circuit breakers to abort aggs on stack traces.

I think that was Jim's working assumption - the question is whether users and admins are going to be happy with that.

rayafratkina · 2021-12-06T14:27:31Z

@mattkime @petrklapka is this closed by mistake or actually confirmed to be working?

mattkime · 2021-12-06T15:44:00Z

@rayafratkina Thanks for bringing this to my attention as I should leave some notes -

wildcard fields have been supported as keyword fields since the field caps api started reporting them as such - elastic/elasticsearch#53175

For more refined handling of these fields we'll need a method of identifying them as their true type - #120284

streamich added Team:AppArch Team:Visualizations Visualization editors, elastic-charts and infrastructure triage_needed labels Mar 24, 2020

timroes added Feature:New Field Type Add support for an Elasticsearch field type in Kibana Feature:KQL KQL and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure triage_needed labels Mar 27, 2020

neptunian mentioned this issue Mar 27, 2020

[Fleet] Support the new wildcard field in index templates and kibana index patterns #61680

Open

timroes mentioned this issue May 12, 2020

Collapse field types into "families" of field types in _field_caps. elastic/elasticsearch#53175

Closed

rayafratkina mentioned this issue Jul 22, 2020

[KQL] Add regex support #46855

Closed

wylieconlon mentioned this issue Oct 14, 2020

[KQL] Should wildcard queries default to case-insensitive search? #80591

Closed

exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Jun 2, 2021

mattkime added the Feature:Data Views Data Views code and UI - index patterns before 8.0 label Oct 13, 2021

exalate-issue-sync bot added loe:medium Medium Level of Effort and removed loe:small Small Level of Effort labels Nov 19, 2021

exalate-issue-sync bot closed this as completed Dec 5, 2021

exalate-issue-sync bot reopened this Apr 19, 2022

exalate-issue-sync bot closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `wildcard` fields #60933

Support for `wildcard` fields #60933

jpountz commented Mar 23, 2020

elasticmachine commented Mar 24, 2020

elasticmachine commented Mar 24, 2020

timroes commented Mar 24, 2020

markharwood commented Mar 25, 2020 •

edited

Loading

timroes commented Mar 27, 2020

markharwood commented Mar 27, 2020 •

edited

Loading

markharwood commented May 15, 2020

jpountz commented Jul 22, 2020

markharwood commented Jul 22, 2020 •

edited

Loading

webmat commented Jul 22, 2020 •

edited

Loading

jpountz commented Jul 23, 2020

markharwood commented Jul 23, 2020 •

edited

Loading

jpountz commented Jul 23, 2020

markharwood commented Jul 23, 2020

jpountz commented Jul 23, 2020

markharwood commented Jul 23, 2020

rayafratkina commented Dec 6, 2021

mattkime commented Dec 6, 2021

Support for wildcard fields #60933

Support for wildcard fields #60933

Comments

jpountz commented Mar 23, 2020

elasticmachine commented Mar 24, 2020

elasticmachine commented Mar 24, 2020

timroes commented Mar 24, 2020

markharwood commented Mar 25, 2020 • edited Loading

timroes commented Mar 27, 2020

markharwood commented Mar 27, 2020 • edited Loading

Wildcards in log message analytics

markharwood commented May 15, 2020

jpountz commented Jul 22, 2020

markharwood commented Jul 22, 2020 • edited Loading

webmat commented Jul 22, 2020 • edited Loading

jpountz commented Jul 23, 2020

markharwood commented Jul 23, 2020 • edited Loading

jpountz commented Jul 23, 2020

markharwood commented Jul 23, 2020

jpountz commented Jul 23, 2020

markharwood commented Jul 23, 2020

rayafratkina commented Dec 6, 2021

mattkime commented Dec 6, 2021

Support for `wildcard` fields #60933

Support for `wildcard` fields #60933

markharwood commented Mar 25, 2020 •

edited

Loading

markharwood commented Mar 27, 2020 •

edited

Loading

markharwood commented Jul 22, 2020 •

edited

Loading

webmat commented Jul 22, 2020 •

edited

Loading

markharwood commented Jul 23, 2020 •

edited

Loading