-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nil reference on agent policy change #34137
Comments
This is happening because the Metricbeat Beat configuration transformation does not guard against the beats/x-pack/metricbeat/cmd/agent.go Line 25 in d8204a2
module := strings.Split(rawIn.Type, "/")[0] We would want We would also want to audit all of the beat configuration transformations for nil safety, not just Metricbeat. Fundamentally, I'm not sure we should even be calling these functions when the input is The real bug may be that we are receiving beats/x-pack/libbeat/management/managerV2.go Lines 492 to 499 in d8204a2
|
What is happening here is that we are getting a unit expected state on the Beat side where both the input and output configs are It does seem to self-correct though. |
It appears to be possible that the agent will send the units with Example message from dumping out the contents of the {"log.level":"info","@timestamp":"2022-12-29T01:36:05.728Z","message":"metricbeat start running.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log.origin":{"file.line":481,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.733Z","message":"CheckinExpectedV2: units:{id:\"http/metrics-monitoring\" type:OUTPUT state:HEALTHY config_state_idx:2 log_level:INFO} units:{id:\"http/metrics-monitoring-metrics-monitoring-agent\" state:HEALTHY config_state_idx:1 log_level:INFO} agent_info:{id:\"d8a450ce-b83e-4d99-8b35-829e48764dbe\" version:\"8.6.0\" snapshot:true}","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"ecs.version":"1.6.0"} On the agent side, here is the transition from HEALTHY to FAILED: {"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Component state changed http/metrics-monitoring (HEALTHY->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (STOPPING->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"FAILED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed http/metrics-monitoring (STOPPING->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"FAILED","old_state":"STOPPING"},"ecs.version":"1.6.0"} Here is the transition from FAILED to STARTING again: {"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Component state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Unit state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"} Here is agent detecting the process as healthy again: {"log.level":"info","@timestamp":"2022-12-29T01:36:05.836Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Unit state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.836Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STARTING->HEALTHY): Healthy","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"} |
This is often causing a second restart of the Beats after they restart to apply an output change, resulting in Beats that are not running for up to 20s with the current process restart timeout. |
I think the root of this issue is that when a process started by the agent exits, the agent itself isn't resetting the last observed unit states causing it to send CheckinExpected messages as if the process didn't just exit and was in sync with the agent. We could equivalently work around this on the Beat side but really this should be properly detected as an error. Edit: when the agent restarts the process it clears the last checkin time and other associated state. This isn't the problem. The agent appears to be sending the config properly in this case. |
Something like this in the elastic agent client_v2.go implementation seems to protect from it (needs a lot more testing): @@ -376,15 +379,19 @@ func (c *clientV2) syncUnits(expected *proto.CheckinExpected) {
unit := c.findUnit(agentUnit.Id, UnitType(agentUnit.Type))
if unit == nil {
// new unit
- unit = newUnit(agentUnit.Id, UnitType(agentUnit.Type), UnitState(agentUnit.S
tate), UnitLogLevel(agentUnit.LogLevel), agentUnit.Config, agentUnit.ConfigStateIdx, c)
- c.units = append(c.units, unit)
- c.unitsCh <- UnitChanged{
- Type: UnitChangedAdded,
- Unit: unit,
+ if agentUnit.Config == nil {
+ unit = newUnit(agentUnit.Id, UnitType(agentUnit.Type), UnitState(agentUnit.State), UnitLogLevel(agentUnit.LogLevel), agentUnit.Config, agentUnit.ConfigStateIdx, c)
+ c.units = append(c.units, unit)
+ c.unitsCh <- UnitChanged{
+ Type: UnitChangedAdded,
+ Unit: unit,
+ }
} It doesn't really make sense to add a unit with no configuration (at least currently) so this is a simple way to hide the panic. I still think the right fix is on the agent side, this was just simple to test and confirm as a possible way to avoid the panic. |
I should note that I observe this panic happening every time I change the agent output at all. This is 100% reproducible behaviour. |
The agent appears to be behaving properly in this situation. I think the nil config is coming in in the Beat side of the V2 client somehow. Edit: nope, agent bug. |
When a previously running process exits, it is possible that the most recent checkin expected message for that process was queued to be sent in the communicator checkin expected channel but never sent. When the process is restarted, this stale previous checkin expected message would then be sent to the new process as the first checkin. This can lead to invalid or unexpected initial checkin expected messages being sent to component processes, for example the configuration index can be set to the correct value but the actual configuration set to nil because the stale message was for a process that was already up to date. Component processes that don't expect this invalid state can then fail at startup, see elastic/beats#34137 for example. To fix this the agent now clears out any pending checkin expected messages before starting new processes or services.
When a previously running process exits, it is possible that the most recent checkin expected message for that process was queued to be sent in the communicator checkin expected channel but never sent. When the process is restarted, this stale previous checkin expected message would then be sent to the new process as the first checkin. This can lead to invalid or unexpected initial checkin expected messages being sent to component processes, for example the configuration index can be set to the correct value but the actual configuration set to nil because the stale message was for a process that was already up to date. Component processes that don't expect this invalid state can then fail at startup, see elastic/beats#34137 for example. To fix this the agent now clears out any pending checkin expected messages before starting new processes or services.
Closed with elastic/elastic-agent#2036 |
When trying to reproduce reopened elastic/elastic-agent#1926 i encountered occasional nil reference here:
this resulted in agent reporting not healthy with
1 or more components/units in a failed state
The text was updated successfully, but these errors were encountered: