Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Enable ECS log detection by default using Custom Log integration #1332

Closed
Tracked by #1454
Mpdreamz opened this issue Jul 19, 2021 · 25 comments
Closed
Tracked by #1454
Labels
enhancement New feature or request Stalled Team:Integrations Label for the Integrations team

Comments

@Mpdreamz
Copy link
Member

The custom log integration currently does not ship with any default ingest pipeline.

In order to improve our ECS logging onboarding experience we should make sure this integration ships with an ingest pipeline that auto-detects ECS.

@felixbarny created a POC for such a default pipeline that utilizes a new option on the dot_expander pipeline that makes sure fields are merged correctly.

Long term we can move this custom pipeline to a single ECS processor in Elasticsearch itself.

Click hero to see POC ingest pipeline
PUT _ingest/pipeline/logs-ecs-json
{
  "processors": [
    {
      "rename": {
        "field": "message",
        "target_field": "_ecs_json_message",
        "ignore_missing": true,
        "if": "ctx.message.startsWith('{') && ctx.message.endsWith('}') && ctx.message.contains('\"@timestamp\"') && ctx.message.contains('\"ecs') && ctx.message.contains('version\"')"
      }
    },
    {
      "json": {
        "field": "_ecs_json_message",
        "add_to_root": true,
        "add_to_root_conflict_strategy": "merge",
        "allow_duplicate_keys": true,
        "if": "ctx.containsKey('_ecs_json_message')",
        "on_failure": [
          {
            "rename": {
              "field": "_ecs_json_message",
              "target_field": "message",
              "ignore_missing": true
            }
          },
          {
            "set": {
              "field": "error.message",
              "value": "Error while parsing JSON",
              "override": false
            }
          }
        ]
      }
    },
    {
      "remove": {
        "field": "_ecs_json_message",
        "ignore_missing": true
      }
    },
    {
      "dot_expander": {
        "field": "*",
        "override": true
      }
    },
    {
      "set": {
        "field": "data_stream.dataset",
        "copy_from": "event.dataset",
        "override": false
      }
    },
    {
      "script": {
        "source": "ctx.data_stream.dataset = /[\\/*?\"<>|, #:-]/.matcher(ctx.data_stream.dataset).replaceAll('_')",
        "if": "ctx.data_stream?.dataset != null"
      }
    },
    {
      "script": {
        "source": "ctx.data_stream.namespace = /[\\/*?\"<>|, #:]/.matcher(ctx.data_stream.namespace).replaceAll('_')",
        "if": "ctx.data_stream?.namespace != null"
      }
    },
    {
      "set": {
        "field": "data_stream.type",
        "value": "logs",
        "override": false
      }
    },
    {
      "set": {
        "field": "data_stream.dataset",
        "value": "generic",
        "override": false
      }
    },
    {
      "set": {
        "field": "data_stream.namespace",
        "value": "default",
        "override": false
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "copy_from": "data_stream.dataset",
        "override": true
      }
    },
    {
      "set": {
        "field": "_index",
        "value": "logs-{{{data_stream.dataset}}}-{{{data_stream.namespace}}}"
      }
    }
  ]
}
@Mpdreamz Mpdreamz added enhancement New feature or request Team:Integrations Label for the Integrations team labels Jul 19, 2021
@elasticmachine
Copy link

Pinging @elastic/integrations (Team:Integrations)

@ruflin
Copy link
Member

ruflin commented Jul 21, 2021

The logs integration does not ship with a template either today and it just uses the mapping provided by the logs-*-* template. As the data stream name is user defined, we need to figure out a way how to potentially load a template for this specific log integration instance and set the pipeline accordingly. This would likely also require work on the Fleet side (@jen-huang ).

@Mpdreamz
Copy link
Member Author

Mpdreamz commented Aug 4, 2021

As the data stream name is user defined, we need to figure out a way how to potentially load a template for this specific log integration instance and set the pipeline accordingly.

@ruflin can you expand on this? Other integrations can add pipeline's directly in the package. Are those able to do it because the datastream name is known ahead of time?

Do packages create a template for:

{type}-{dataset}-*

Where {namespace} is potentially user defined but will always share the template?

@ruflin
Copy link
Member

ruflin commented Aug 5, 2021

Exactly, all the other packages have specific templates for {type}-{dataset}-*. But as {dataset} is configured by the user for the logs integration, it is not the case.

@felixbarny
Copy link
Member

I gave this a bit more thought and have some ideas on how to move forward:

  • We could create a ECS logging integration that's similar to the Custom logs integration but comes with defaults specific to ECS JSON log files
  • This integration creates a dedicated ECS logging data stream, such as logs-ecs_router-*. This special data stream has a default ingest pipeline associated with it. When logs are sent to that data stream, they'll never actually end up in that data stream. Instead the logs are parsed and routed to other data streams via the default ingest pipeline.
    • If the log is a ECS JSON log, it will be properly parsed using a json processor
    • If the ECS logs contain data_stream.dataset and/or data_stream.namespace field, they'll be routed to that data stream using a set processor that overrides the _index field.
    • If the logs don't contain a specific data stream or if they are not proper ECS json logs, they're routed to a default data stream, such as logs-generic-default.
  • When we install this integration by default, other integrations can re-use the ECS JSON parsing pipeline just by overriding the _index field to logs-ecs_router-default when they detect that a log line is ECS JSON compatible. This avoids having to copy/paste the pipeline to other integrations.
  • Due to the dynamic routing, we don't know all the target data streams upfront and therefore can't add mappings to these.
    • We can't expect users to manually configure the mappings for all their log data streams upfront.
    • The logs-*-* data stream only has minimal mappings, missing most of the ECS logging fields. Can we change the mapping for logs-*-* to contain all relevant ECS fields?
    • As an alternative that doesn't require changes to logs-*-*, we could add a mapping in the ECS logs integration for the logs-ecs.* prefix, similar to what APM Server does for the metrics-apm.app.<service.name>-* data stream. Instead of routing to logs-{{{data_stream.dataset}}}-{{{data_stream.namespace}}}, we would route to logs-ecs.{{{data_stream.dataset}}}-{{{data_stream.namespace}}} so that the full ECS log mappings are applied instead of the default logs-*-* mappings.
      The tradeoff is that the data streams would look like logs-ecs.my_app-default instead of logs-my_app-default but that's probably fine.

@ruflin
Copy link
Member

ruflin commented Apr 1, 2022

The logs-- data stream only has minimal mappings, missing most of the ECS logging fields. Can we change the mapping for logs-- to contain all relevant ECS fields?

What are mappings you are missing. There is a default for keywords and the message field is also mapped correctly + ip fields: https:/elastic/elasticsearch/blob/feature/apm-integration/x-pack/plugin/core/src/main/resources/data-streams-mappings.json It is all dynamic mappings but if the fields are keywords anyways, this should not be an issue.

My preference would be to not work with the ecs. prefix if possible because I think of all the data streams as ECS.

What you propose is worth a try. I think it is not ideal that we have to rewrite _index but as the routing feature does not exist yet, I think it is only option.

@felixbarny
Copy link
Member

felixbarny commented Apr 1, 2022

if the fields are keywords anyways, this should not be an issue.

Good point. That will probably do for most cases.
One area where explicit mappings are beneficial is when custom fields would create a mapping that conflicts with ECS. Specifically if a field is supposed to be an object but if a custom field is added that's a keyword. For example, if the custom field error: true is indexed, this will create a boolean mapping. If after that a proper ECS field error.message gets indexed, the valid ECS log line will cause an indexing error.

It seems like there have been similar issues with the host field which is mapped to object in the generic data stream mapping:
https:/elastic/elasticsearch/blob/d834dd5a2ffd4543b65455fe7ee3334fdf4b42f3/x-pack/plugin/core/src/main/resources/data-streams-mappings.json#L56-L58

It might make sense to do that with more/all top-level ECS fields.

However, custom fields could still corrupt the mapping if they map an ECS field to a different type, such as "log.level": 1 vs "log.level": "INFO".

@Mpdreamz suggested that ECS field mappings should be part of Elasticsearch itself. That sounds like a good idea but also like a long term thing.

What you propose is worth a try.

Cool, I'll create a POC for an ECS log file integration.

I think it is not ideal that we have to rewrite _index but as the routing feature does not exist yet, I think it is only option.

I suppose we can always change the ECS pipeline later on when there are new routing capabilities in ES. But for now, it seems overriding _index does what we want and even if there will be more sophisticated event routing in the future, it might still rely on overriding _index.

A while ago, I've created this POC. Please have a look:

@felixbarny felixbarny mentioned this issue Apr 1, 2022
13 tasks
@felixbarny
Copy link
Member

Here's a first POC: #2972

@ruflin
Copy link
Member

ruflin commented Apr 4, 2022

It might make sense to do that with more/all top-level ECS fields.

If there is a common problem around the certain fields like host that you pointed out, +1 on adding these. But in general I would keep it to a minimum and set the bar really high. One thing we can also do for the common cases is to catch them in the ingest pipeline and to magical renaming. The benefit of this is that in case of a conflict, ingestions would normally stop but with this is continues.

About ECS mappings as part of Elasticsearch: I think more of it as a config option on a data stream and a very small subset of ECS. There are too many fields by now in ECS.

I missed elastic/elasticsearch#76511 but this sounds interesting. Like a very small subset of the routing pipeline but focused only on the data stream naming scheme and making sure we don't need to think about _index fields etc.

@ruflin
Copy link
Member

ruflin commented Apr 4, 2022

For the POC #2972 it might be worth doing a quick recording with a demo and send it to a wider group. Seeing it in action will make it much easier to explain.

@felixbarny
Copy link
Member

felixbarny commented Apr 4, 2022

@ruflin why don't we include all ECS fields in data-streams-mappings.json? I suppose there some challenges but also solutions for them:

  • ECS used to have a different versioning scheme but is now aligned with the stack release cycle.
  • Automatically updating the mappings in the Elasticsearch repo based on changes in the ECS repo needs automation.
  • Big mappings aren't free. If an index has a lot of unused mappings, I suppose that comes with overhead on the cluster state and decreases the efficiency as the fields are sparse. To work around that, we could use the path_match feature of dynamic templates to apply a dynamic template (that's effectively static) for all ECS fields. By doing that, fields only get added to the index mapping if there's actually a document using that field. For example, we could define a dynamic template with a path_match log.level that makes sure it's mapped to a keyword field.
Click to expand dynamic template example for `log.level`
DELETE /test?ignore_unavailable
PUT /test
{
  "mappings": {
    "dynamic_templates": [
      {
        "log_level": {
          "path_match": "log.level",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  }
}

PUT test/_doc/1?refresh
{
  "log": {
    "level": 1
  }
}

PUT test/_doc/2?refresh
{
  "log": {
    "level": "INFO"
  }
}

GET test/_search
GET test/_mapping

Are there any other reasons that I'm missing?

@Mpdreamz
Copy link
Member Author

Mpdreamz commented Apr 4, 2022

About ECS mappings as part of Elasticsearch: I think more of it as a config option on a data stream and a very small subset of ECS. There are too many fields by now in ECS.

I was thinking as well this should be as easy as a datastream setting e.g ecs: true.
However ideally it should be exhaustive and in a way that we don't pay the costs for the actual mapping until a field is actually indexed. path_match could be such a way but it'd be neat if Elasticsearch could abstract most of the noise around this even further so that we don't ship with templates that are several MB's in size.

Having datastreams be fully ready for ECS data would be huge without possibly introducing mapping differences between different integrations.

cc @jpountz @qhoxie and @elastic/ecs I'd be curious to hear if this has been discussed in the past already?

@felixbarny
Copy link
Member

felixbarny commented Apr 4, 2022

I have just discovered a more general issue with the approach. The pipeline for logs-ecs_router.* dynamically creates data streams based on the data_stream.dataset value in the log lines. The problem is that the API key that the integrations use don't allow to create indices. This results in the following security exception on ingest:

{"type":"security_exception","reason":"action [indices:admin/auto_create] is unauthorized for API key id [8-dt9H8BqGblnY2uSI--] of user [elastic/fleet-server] on indices [logs-foo-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]"}

Is there any way we can add the auto_create or create_index privilege to the API key?

For the POC #2972 it might be worth doing a quick recording with a demo and send it to a wider group. Seeing it in action will make it much easier to explain.

Good idea, but the issue described above makes it difficult to do an end-to-end demo. As the privileges for API keys can't be modified after the fact, I also don't see how I could create a temporary workaround. @ruflin do you have an idea?

@felixbarny
Copy link
Member

felixbarny commented Apr 4, 2022

I just noticed a feature in the package spec that lets you define elasticsearch privileges on an integration.

I'll try adding

elasticsearch.privileges.indices: [auto_configure, create_doc, monitor]

To the main manifest. Not sure if monitor is required.

Update: seems like this didn't help

@felixbarny
Copy link
Member

I think the issue is that the API key is scoped to the data stream for the integration.

Because the manifest specifies

type: logs
dataset: ecs_router

the integration can only index into logs-ecs_router-*.

We'd need a way to define a default pipeline for logs-ecs_router-* but allow the integration to dynamically create any logs-* data stream. I suppose there's currently no built-in feature that allows for that, right?

@ruflin
Copy link
Member

ruflin commented Apr 5, 2022

@joshdover @mtojek Do we have today already an escape hatch to allow an integration to the get the "default" permissions like logs-*-*, this would solve the problem temporarily. Any package that will have something similar to a "routing pipeline" will face this issue. Maybe an additional permission flag could help here?

The ECS issue discussed above now bubbles up in several places and we need to have a shared discussion around it. I'll file an issue for it in ECS and will reference it here.

@ruflin
Copy link
Member

ruflin commented Apr 5, 2022

I just filed elastic/ecs#1869 so we have this discussion in public inside the ECS repo.

@jpountz
Copy link

jpountz commented Apr 5, 2022

@Mpdreamz I think the idea has been floating around a few times but I'm not aware of concrete discussions. Having an option to make Elasticsearch's default dynamic mapping rules ECS-compliant sounds like something that we could do by packaging ECS within Elasticsearch.

On the note of adding ingest pipeline for application logs, I know that @dliappis is monitoring this topic carefully because application logs are one of the most bursty source of data that we handle and ingest pipelines proved to incur non-negligible overhead.

@felixbarny
Copy link
Member

@joshdover @mtojek Do we have today already an escape hatch to allow an integration to the get the "default" permissions like logs--, this would solve the problem temporarily. Any package that will have something similar to a "routing pipeline" will face this issue. Maybe an additional permission flag could help here?

I have filed elastic/package-spec#315 to discuss the permission flag.

@zez3
Copy link

zez3 commented Apr 9, 2022

There is a default for keywords and the message field is also mapped correctly + ip fields: https:/elastic/elasticsearch/blob/feature/apm-integration/x-pack/plugin/core/src/main/resources/data-streams-mappings.json It is all dynamic mappings but if the fields are keywords anyways, this should not be an issue.

Funny thing, my dynamic mappings did not contained the ip match part, so all my *.ip fields got mapped as keywords. Perhaps was not updated(I use custom udp version 1.0.1 to parse Cisco WLC logs)? Even if in my beats dissect processoes I specify in the tokekenizer type as ip. Anyway I tried to reindex and that did NOT converted my *.ip field types to ip. I also tried an update by query with an ingest pipeline. That also did not changed my field type mappings.
Is there an workaround here?

update: I had to rollover and then it worked

Let me know if you need some testing after a solution was chosen. It seems that I am always finding myself running the lastest release on my prod environment.

@ruflin ruflin self-assigned this Jun 7, 2022
@zez3
Copy link

zez3 commented Jun 20, 2022

I would love to have the all ECS fields per default in all the custom integrations so that I don't need to explicitly add them in my templates. After an package update I had o re-add them today :(

Perhaps we should just add them to ecs.yml but am I not sure if that conflicts with the other discussion #2839

@felixbarny
Copy link
Member

We had an idea to make ECS mappings native in ES so that integrations don't need to manually add any ECS fields: elastic/elasticsearch#85692

@zez3
Copy link

zez3 commented Jun 20, 2022

The logs integration does not ship with a template either today and it just uses the mapping provided by the logs-*-* template. As the data stream name is user defined, we need to figure out a way how to potentially load a template for this specific log integration instance and set the pipeline accordingly. This would likely also require work on the Fleet side (@jen-huang ).

@ruflin
What if I where to set the ECS mappings in logs-*-* template ? Will this be updated by the next update?

@ruflin
Copy link
Member

ruflin commented Jun 20, 2022

@zez3 Updates are rare but if it is updated, yes. Instead we should support @custom also on these base templates so you could add it there. You are likely interested in elastic/elasticsearch#85692

@ruflin ruflin removed their assignment Jun 24, 2022
@botelastic
Copy link

botelastic bot commented Jun 24, 2023

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jun 24, 2023
@botelastic botelastic bot closed this as completed Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stalled Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants