Making all *.name fields be multi-field #2118

P1llus · 2022-11-29T10:31:36Z

Similar to #2047, it would be nice if certain fields that are reused a large amount would be a bit more consistent.

Looking at *.name, which is a large part of ECS, there is about 50/50 consistency between keyword only and a multifield mapping. Would it be possible to make it all one or the other?
While trying to make dynamic templating match for ECS fields, one of the biggest inconsistencies are this specific field name.

The text was updated successfully, but these errors were encountered:

dainperkins · 2023-01-09T14:42:51Z

FWIW, I think any normalization like this is worth the effort. Literally just ran into this issue for user.name

ebeahan · 2023-01-17T21:23:55Z

Agreed there are benefits to better semantic consistency. Some patterns in naming conventions have emerged over time, but not all patterns are consistent across field sets.

Two challenges I see:

Each multi-field means additional indexing load and increased storage. I wouldn't expect the impact to be substantial but likely still noticeable.
If adding the multi-field indexed of type text will useful for every case. With the standard analyzer, some tokenization won't work as expected for some values*.

*Tokenization examples

A user may want to query for hosts matching siem.estc.dev and would be successful. However, estc.dev wouldn't return any results since the standard analyzer isn't tokenizing the TLD.

POST /_analyze
{
  "analyzer": "standard",
  "text": "security-linux-1.siem.estc.dev"
}

{
  "tokens": [
    {
      "token": "security",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "linux",
      "start_offset": 9,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "1",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<NUM>",
      "position": 2
    },
    {
      "token": "siem.estc.dev",
      "start_offset": 17,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Another case with a simple machine name, workstation1. A user may query workstation, but the only token is the entire string: workstation1.

POST /_analyze
{
  "analyzer": "standard",
  "text": "workstation1"
}

{
  "tokens": [
    {
      "token": "workstation1",
      "start_offset": 0,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

P1llus · 2023-01-26T09:06:28Z

I mean we could go the other way around as well @ebeahan? And just have all as keywords?

As long as they are consistent its fine. One or two fields breaking the rule is not that big of a deal, but when its more a 50/50 distribution it feels a bit bad, and its harder to expect what types the fields should be.

P1llus added enhancement New feature or request discuss labels Nov 29, 2022

This was referenced Sep 2, 2024

Resolve discrepancy between text subfield handling for *.name fields in ecs@mappings #2353

Open

Define how semantic convention fields should be mapped #2375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making all *.name fields be multi-field #2118

Making all *.name fields be multi-field #2118

P1llus commented Nov 29, 2022

dainperkins commented Jan 9, 2023

ebeahan commented Jan 17, 2023

P1llus commented Jan 26, 2023

Making all *.name fields be multi-field #2118

Making all *.name fields be multi-field #2118

Comments

P1llus commented Nov 29, 2022

dainperkins commented Jan 9, 2023

ebeahan commented Jan 17, 2023

P1llus commented Jan 26, 2023