Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making all *.name fields be multi-field #2118

Open
P1llus opened this issue Nov 29, 2022 · 3 comments
Open

Making all *.name fields be multi-field #2118

P1llus opened this issue Nov 29, 2022 · 3 comments
Labels
discuss enhancement New feature or request

Comments

@P1llus
Copy link
Member

P1llus commented Nov 29, 2022

Similar to #2047, it would be nice if certain fields that are reused a large amount would be a bit more consistent.

Looking at *.name, which is a large part of ECS, there is about 50/50 consistency between keyword only and a multifield mapping. Would it be possible to make it all one or the other?
While trying to make dynamic templating match for ECS fields, one of the biggest inconsistencies are this specific field name.

@P1llus P1llus added enhancement New feature or request discuss labels Nov 29, 2022
@dainperkins
Copy link
Contributor

FWIW, I think any normalization like this is worth the effort. Literally just ran into this issue for user.name

@ebeahan
Copy link
Member

ebeahan commented Jan 17, 2023

Agreed there are benefits to better semantic consistency. Some patterns in naming conventions have emerged over time, but not all patterns are consistent across field sets.

Two challenges I see:

  1. Each multi-field means additional indexing load and increased storage. I wouldn't expect the impact to be substantial but likely still noticeable.
  2. If adding the multi-field indexed of type text will useful for every case. With the standard analyzer, some tokenization won't work as expected for some values*.
*Tokenization examples

A user may want to query for hosts matching siem.estc.dev and would be successful. However, estc.dev wouldn't return any results since the standard analyzer isn't tokenizing the TLD.

POST /_analyze
{
  "analyzer": "standard",
  "text": "security-linux-1.siem.estc.dev"
}

{
  "tokens": [
    {
      "token": "security",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "linux",
      "start_offset": 9,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "1",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<NUM>",
      "position": 2
    },
    {
      "token": "siem.estc.dev",
      "start_offset": 17,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Another case with a simple machine name, workstation1. A user may query workstation, but the only token is the entire string: workstation1.

POST /_analyze
{
  "analyzer": "standard",
  "text": "workstation1"
}

{
  "tokens": [
    {
      "token": "workstation1",
      "start_offset": 0,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

@P1llus
Copy link
Member Author

P1llus commented Jan 26, 2023

I mean we could go the other way around as well @ebeahan? And just have all as keywords?

As long as they are consistent its fine. One or two fields breaking the rule is not that big of a deal, but when its more a 50/50 distribution it feels a bit bad, and its harder to expect what types the fields should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants