Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate limit "Cannot index event" log messages #40157

Closed
cmacknz opened this issue Jul 9, 2024 · 8 comments · Fixed by #40448
Closed

Rate limit "Cannot index event" log messages #40157

cmacknz opened this issue Jul 9, 2024 · 8 comments · Fixed by #40448
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Jul 9, 2024

// Fatal error and no dead letter index, drop.
client.log.Warnf("Cannot index event (status=%v): dropping event! Look at the event log to view the event and cause.", itemStatus)
client.log.Warnw(fmt.Sprintf("Cannot index event %#v (status=%v): %s, dropping event!", event, itemStatus, itemMessage), logp.TypeKey, logp.EventType)
stats.nonIndexable++
return false

The "Cannot index event" logs messages are a useful signal in the logs that events are being dropped and (as of 8.15.0) you should look at the local event log for the reason.

Since this log message does not contain any useful debugging information, and has the potential to be generated for every event that flows through the pipeline, there is no value in logging it for each event.

Instead we should rate limit it so that it only appears once in a fixed interval when events are being dropped. The rate limit is initially proposed to be one message every 10 seconds.

The rate limited message should include the number of events that dropped in the current interval. The message can be changed to something like "Failed to index N events in last M seconds. Look at the event log to view the events and cause."

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jul 9, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@jlind23
Copy link
Collaborator

jlind23 commented Jul 12, 2024

@pierrehilbert bumping the priority on this one as it recently had an impact on some users.
cc @lucabelluccini @nimarezainia

@AndersonQ
Copy link
Member

@cmacknz, one question, should the report summarise how many events per status code? E.g.:

  • 10 events dropped:
    • 5: 402
    • 5: 403

@cmacknz
Copy link
Member Author

cmacknz commented Aug 9, 2024

It's useful if it is easy to do, if it adds significant complexity I wouldn't bother. The status code will be in the event logs.

@nimarezainia
Copy link
Contributor

after speaking with @pierrehilbert, we would love to celebrate the benefits of the outcomes from this issue. Are we able to quantify the reduction in events/logs sent?

@AndersonQ
Copy link
Member

after speaking with @pierrehilbert, we would love to celebrate the benefits of the outcomes from this issue. Are we able to quantify the reduction in events/logs sent?

yes, if we have access to old logs, we can quantify it. I was actually thinking about quantifying it as well and add to the PR, but I got busy with other tasks.

let me try to make a quick and rough estimation

@konnextv
Copy link

konnextv commented Oct 2, 2024

I came across this error today in a Metricbeat log and am wondering where to find the mentioned event log?
From my understanding I am looking at it and this is where I see the warning as well: /var/log/metricbeat/metricbeat-20241002-7.ndjson

@cmacknz
Copy link
Member Author

cmacknz commented Oct 2, 2024

You need to be on 8.15+, by default they are next to the regular logs. If you are below 8.15, you can see the cause by turning on debug logging in the regular log files.

See

#=============================== Events Logging ===============================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants