Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nil reference on agent policy change #34137

Closed
michalpristas opened this issue Dec 27, 2022 · 9 comments · Fixed by elastic/elastic-agent#2036
Closed

Nil reference on agent policy change #34137

michalpristas opened this issue Dec 27, 2022 · 9 comments · Fixed by elastic/elastic-agent#2036
Assignees
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@michalpristas
Copy link
Contributor

When trying to reproduce reopened elastic/elastic-agent#1926 i encountered occasional nil reference here:

Unit state changed log-default-logfile-system-dfb73c57-14fc-4724-83d7-3b34fabed84a (STARTING->FAILED): [output unit has no config]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1049cb690]
goroutine 2182 [running]:
github.com/elastic/beats/v7/x-pack/metricbeat/cmd.metricbeatCfg(0x0, 0xc001282968?)
github.com/elastic/beats/v7/x-pack/metricbeat/cmd/agent.go:25 +0x50
github.com/elastic/beats/v7/x-pack/libbeat/management.(*TransformRegister).Transform(0x5?, 0x19?, 0x200000002?)
github.com/elastic/beats/v7/x-pack/libbeat/management/generate.go:60 +0x103
github.com/elastic/beats/v7/x-pack/libbeat/management.generateBeatConfig(0xc001697190?, 0xc001282a98?)
github.com/elastic/beats/v7/x-pack/libbeat/management/generate.go:232 +0x26
github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).reloadInputs(0xc000a23440, {0xc00028da20, 0x1, 0x0?})
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:540 +0x155
github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).reload(0xc000a23440, 0xc001282da8?)
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:450 +0x56b
github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).unitListen(0xc000a23440)
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:379 +0x405
created by github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).Start
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:178 +0x116

this resulted in agent reporting not healthy with 1 or more components/units in a failed state

@cmacknz
Copy link
Member

cmacknz commented Dec 28, 2022

This is happening because the Metricbeat Beat configuration transformation does not guard against the rawIn variable being nil:

module := strings.Split(rawIn.Type, "/")[0]

module := strings.Split(rawIn.Type, "/")[0]

We would want module := strings.Split(rawIn.GetType(), "/")[0] instead.

We would also want to audit all of the beat configuration transformations for nil safety, not just Metricbeat.

Fundamentally, I'm not sure we should even be calling these functions when the input is nil because they can't reasonably do anything with it.

The real bug may be that we are receiving nil input unit configurations. We guard against this for output changes for example:

if unit == nil {
// output is being stopped
err := output.Reload(nil)
if err != nil {
return fmt.Errorf("failed to reload output: %w", err)
}
cm.lastOutputCfg = nil
return nil

@cmacknz cmacknz transferred this issue from elastic/elastic-agent Dec 28, 2022
@cmacknz
Copy link
Member

cmacknz commented Dec 28, 2022

What is happening here is that we are getting a unit expected state on the Beat side where both the input and output configs are nil but the state is Healthy. Something is definitely wrong here, I'm not sure which part of the system the problem is in though.

It does seem to self-correct though.

@cmacknz
Copy link
Member

cmacknz commented Dec 29, 2022

It appears to be possible that the agent will send the units with nil configurations at startup. I suspect it is sending the log level before any other information is available, or possibly the agent is sending a configuration that would have been correct for the previous process but is not correct for the current process.

Example message from dumping out the contents of the CheckinExpected message on the Beat side when this happens:

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.728Z","message":"metricbeat start running.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log.origin":{"file.line":481,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.733Z","message":"CheckinExpectedV2: units:{id:\"http/metrics-monitoring\"  type:OUTPUT  state:HEALTHY  config_state_idx:2  log_level:INFO}  units:{id:\"http/metrics-monitoring-metrics-monitoring-agent\"  state:HEALTHY  config_state_idx:1  log_level:INFO}  agent_info:{id:\"d8a450ce-b83e-4d99-8b35-829e48764dbe\"  version:\"8.6.0\"  snapshot:true}","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"ecs.version":"1.6.0"}

On the agent side, here is the transition from HEALTHY to FAILED:

{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Component state changed http/metrics-monitoring (HEALTHY->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (STOPPING->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"FAILED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed http/metrics-monitoring (STOPPING->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"FAILED","old_state":"STOPPING"},"ecs.version":"1.6.0"}

Here is the transition from FAILED to STARTING again:

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Component state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Unit state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}

Here is agent detecting the process as healthy again:

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.836Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Unit state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.836Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STARTING->HEALTHY): Healthy","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Dec 29, 2022

This is often causing a second restart of the Beats after they restart to apply an output change, resulting in Beats that are not running for up to 20s with the current process restart timeout.

@cmacknz
Copy link
Member

cmacknz commented Dec 29, 2022

I think the root of this issue is that when a process started by the agent exits, the agent itself isn't resetting the last observed unit states causing it to send CheckinExpected messages as if the process didn't just exit and was in sync with the agent.

We could equivalently work around this on the Beat side but really this should be properly detected as an error.

Edit: when the agent restarts the process it clears the last checkin time and other associated state. This isn't the problem. The agent appears to be sending the config properly in this case.

https:/elastic/elastic-agent/blob/2de2cbbcff9c4d9e1aab8f565b19150be701c7a6/pkg/component/runtime/command.go#L331-L341

@cmacknz
Copy link
Member

cmacknz commented Dec 29, 2022

Something like this in the elastic agent client_v2.go implementation seems to protect from it (needs a lot more testing):

@@ -376,15 +379,19 @@ func (c *clientV2) syncUnits(expected *proto.CheckinExpected) {
                unit := c.findUnit(agentUnit.Id, UnitType(agentUnit.Type))
                if unit == nil {
                        // new unit
-                       unit = newUnit(agentUnit.Id, UnitType(agentUnit.Type), UnitState(agentUnit.S
tate), UnitLogLevel(agentUnit.LogLevel), agentUnit.Config, agentUnit.ConfigStateIdx, c)
-                       c.units = append(c.units, unit)
-                       c.unitsCh <- UnitChanged{
-                               Type: UnitChangedAdded,
-                               Unit: unit,
+                       if agentUnit.Config == nil {
+                               unit = newUnit(agentUnit.Id, UnitType(agentUnit.Type), UnitState(agentUnit.State), UnitLogLevel(agentUnit.LogLevel), agentUnit.Config, agentUnit.ConfigStateIdx, c)
+                               c.units = append(c.units, unit)
+                               c.unitsCh <- UnitChanged{
+                                       Type: UnitChangedAdded,
+                                       Unit: unit,
+                               }
                        }

It doesn't really make sense to add a unit with no configuration (at least currently) so this is a simple way to hide the panic. I still think the right fix is on the agent side, this was just simple to test and confirm as a possible way to avoid the panic.

@cmacknz
Copy link
Member

cmacknz commented Dec 29, 2022

I should note that I observe this panic happening every time I change the agent output at all. This is 100% reproducible behaviour.

@cmacknz
Copy link
Member

cmacknz commented Dec 30, 2022

The agent appears to be behaving properly in this situation. I think the nil config is coming in in the Beat side of the V2 client somehow.

Edit: nope, agent bug.

@cmacknz cmacknz self-assigned this Dec 31, 2022
cmacknz added a commit to cmacknz/elastic-agent that referenced this issue Dec 31, 2022
When a previously running process exits, it is possible that the most
recent checkin expected message for that process was queued to be sent
in the communicator checkin expected channel but never sent.

When the process is restarted, this stale previous checkin expected
message  would then be sent to the new process as the first checkin.

This can lead to invalid or unexpected initial checkin expected messages
being sent to component processes, for example the configuration index
can be set to the correct value but the actual configuration set to nil
because the stale message was for a process that was already up to date.

Component processes that don't expect this invalid state can then fail
at startup, see elastic/beats#34137 for
example.

To fix this the agent now clears out any pending checkin expected
messages before starting new processes or services.
cmacknz added a commit to cmacknz/elastic-agent that referenced this issue Dec 31, 2022
When a previously running process exits, it is possible that the most
recent checkin expected message for that process was queued to be sent
in the communicator checkin expected channel but never sent.

When the process is restarted, this stale previous checkin expected
message  would then be sent to the new process as the first checkin.

This can lead to invalid or unexpected initial checkin expected messages
being sent to component processes, for example the configuration index
can be set to the correct value but the actual configuration set to nil
because the stale message was for a process that was already up to date.

Component processes that don't expect this invalid state can then fail
at startup, see elastic/beats#34137 for
example.

To fix this the agent now clears out any pending checkin expected
messages before starting new processes or services.
@cmacknz
Copy link
Member

cmacknz commented Jan 3, 2023

Closed with elastic/elastic-agent#2036

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants