Nil reference on agent policy change #34137

michalpristas · 2022-12-27T15:42:26Z

When trying to reproduce reopened elastic/elastic-agent#1926 i encountered occasional nil reference here:

Unit state changed log-default-logfile-system-dfb73c57-14fc-4724-83d7-3b34fabed84a (STARTING->FAILED): [output unit has no config]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1049cb690]
goroutine 2182 [running]:
github.com/elastic/beats/v7/x-pack/metricbeat/cmd.metricbeatCfg(0x0, 0xc001282968?)
github.com/elastic/beats/v7/x-pack/metricbeat/cmd/agent.go:25 +0x50
github.com/elastic/beats/v7/x-pack/libbeat/management.(*TransformRegister).Transform(0x5?, 0x19?, 0x200000002?)
github.com/elastic/beats/v7/x-pack/libbeat/management/generate.go:60 +0x103
github.com/elastic/beats/v7/x-pack/libbeat/management.generateBeatConfig(0xc001697190?, 0xc001282a98?)
github.com/elastic/beats/v7/x-pack/libbeat/management/generate.go:232 +0x26
github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).reloadInputs(0xc000a23440, {0xc00028da20, 0x1, 0x0?})
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:540 +0x155
github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).reload(0xc000a23440, 0xc001282da8?)
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:450 +0x56b
github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).unitListen(0xc000a23440)
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:379 +0x405
created by github.com/elastic/beats/v7/x-pack/libbeat/management.(*BeatV2Manager).Start
github.com/elastic/beats/v7/x-pack/libbeat/management/managerV2.go:178 +0x116

this resulted in agent reporting not healthy with 1 or more components/units in a failed state

The text was updated successfully, but these errors were encountered:

cmacknz · 2022-12-28T17:27:19Z

This is happening because the Metricbeat Beat configuration transformation does not guard against the rawIn variable being nil:

beats/x-pack/metricbeat/cmd/agent.go

Line 25 in d8204a2

module := strings.Split(rawIn.Type, "/")[0]

module := strings.Split(rawIn.Type, "/")[0]

We would want module := strings.Split(rawIn.GetType(), "/")[0] instead.

We would also want to audit all of the beat configuration transformations for nil safety, not just Metricbeat.

Fundamentally, I'm not sure we should even be calling these functions when the input is nil because they can't reasonably do anything with it.

The real bug may be that we are receiving nil input unit configurations. We guard against this for output changes for example:

beats/x-pack/libbeat/management/managerV2.go

Lines 492 to 499 in d8204a2

 if unit == nil { 

 // output is being stopped 

 err := output.Reload(nil) 

 if err != nil { 

 return fmt.Errorf("failed to reload output: %w", err) 

 } 

 cm.lastOutputCfg = nil 

 return nil

cmacknz · 2022-12-28T20:08:16Z

What is happening here is that we are getting a unit expected state on the Beat side where both the input and output configs are nil but the state is Healthy. Something is definitely wrong here, I'm not sure which part of the system the problem is in though.

It does seem to self-correct though.

cmacknz · 2022-12-29T01:42:03Z

It appears to be possible that the agent will send the units with nil configurations at startup. I suspect it is sending the log level before any other information is available, or possibly the agent is sending a configuration that would have been correct for the previous process but is not correct for the current process.

Example message from dumping out the contents of the CheckinExpected message on the Beat side when this happens:

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.728Z","message":"metricbeat start running.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log.origin":{"file.line":481,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.733Z","message":"CheckinExpectedV2: units:{id:\"http/metrics-monitoring\"  type:OUTPUT  state:HEALTHY  config_state_idx:2  log_level:INFO}  units:{id:\"http/metrics-monitoring-metrics-monitoring-agent\"  state:HEALTHY  config_state_idx:1  log_level:INFO}  agent_info:{id:\"d8a450ce-b83e-4d99-8b35-829e48764dbe\"  version:\"8.6.0\"  snapshot:true}","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"ecs.version":"1.6.0"}

On the agent side, here is the transition from HEALTHY to FAILED:

{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Component state changed http/metrics-monitoring (HEALTHY->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (STOPPING->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"FAILED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-29T01:35:55.618Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed http/metrics-monitoring (STOPPING->FAILED): Failed: pid '79325' exited with code '0'","component":{"id":"http/metrics-monitoring","state":"FAILED"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"FAILED","old_state":"STOPPING"},"ecs.version":"1.6.0"}

Here is the transition from FAILED to STARTING again:

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Component state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-12-29T01:36:05.648Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Unit state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '79482'","component":{"id":"http/metrics-monitoring","state":"STARTING"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"STARTING","old_state":"FAILED"},"ecs.version":"1.6.0"}

Here is agent detecting the process as healthy again:

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.836Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Unit state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2022-12-29T01:36:05.836Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STARTING->HEALTHY): Healthy","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}

cmacknz · 2022-12-29T02:27:21Z

This is often causing a second restart of the Beats after they restart to apply an output change, resulting in Beats that are not running for up to 20s with the current process restart timeout.

cmacknz · 2022-12-29T02:44:01Z

I think the root of this issue is that when a process started by the agent exits, the agent itself isn't resetting the last observed unit states causing it to send CheckinExpected messages as if the process didn't just exit and was in sync with the agent.

We could equivalently work around this on the Beat side but really this should be properly detected as an error.

Edit: when the agent restarts the process it clears the last checkin time and other associated state. This isn't the problem. The agent appears to be sending the config properly in this case.

https:/elastic/elastic-agent/blob/2de2cbbcff9c4d9e1aab8f565b19150be701c7a6/pkg/component/runtime/command.go#L331-L341

cmacknz · 2022-12-29T02:55:31Z

Something like this in the elastic agent client_v2.go implementation seems to protect from it (needs a lot more testing):

@@ -376,15 +379,19 @@ func (c *clientV2) syncUnits(expected *proto.CheckinExpected) {
                unit := c.findUnit(agentUnit.Id, UnitType(agentUnit.Type))
                if unit == nil {
                        // new unit
-                       unit = newUnit(agentUnit.Id, UnitType(agentUnit.Type), UnitState(agentUnit.S
tate), UnitLogLevel(agentUnit.LogLevel), agentUnit.Config, agentUnit.ConfigStateIdx, c)
-                       c.units = append(c.units, unit)
-                       c.unitsCh <- UnitChanged{
-                               Type: UnitChangedAdded,
-                               Unit: unit,
+                       if agentUnit.Config == nil {
+                               unit = newUnit(agentUnit.Id, UnitType(agentUnit.Type), UnitState(agentUnit.State), UnitLogLevel(agentUnit.LogLevel), agentUnit.Config, agentUnit.ConfigStateIdx, c)
+                               c.units = append(c.units, unit)
+                               c.unitsCh <- UnitChanged{
+                                       Type: UnitChangedAdded,
+                                       Unit: unit,
+                               }
                        }

It doesn't really make sense to add a unit with no configuration (at least currently) so this is a simple way to hide the panic. I still think the right fix is on the agent side, this was just simple to test and confirm as a possible way to avoid the panic.

cmacknz · 2022-12-29T03:02:13Z

I should note that I observe this panic happening every time I change the agent output at all. This is 100% reproducible behaviour.

cmacknz · 2022-12-30T18:54:16Z

The agent appears to be behaving properly in this situation. I think the nil config is coming in in the Beat side of the V2 client somehow.

Edit: nope, agent bug.

When a previously running process exits, it is possible that the most recent checkin expected message for that process was queued to be sent in the communicator checkin expected channel but never sent. When the process is restarted, this stale previous checkin expected message would then be sent to the new process as the first checkin. This can lead to invalid or unexpected initial checkin expected messages being sent to component processes, for example the configuration index can be set to the correct value but the actual configuration set to nil because the stale message was for a process that was already up to date. Component processes that don't expect this invalid state can then fail at startup, see elastic/beats#34137 for example. To fix this the agent now clears out any pending checkin expected messages before starting new processes or services.

cmacknz · 2023-01-03T13:49:29Z

Closed with elastic/elastic-agent#2036

michalpristas added bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 27, 2022

michalpristas self-assigned this Dec 27, 2022

michalpristas mentioned this issue Dec 27, 2022

Installed agents gets Unhealthy on setting invalid url as default under Fleet Server hosts and even on setting back to valid, agents remains Unhealthy. elastic/elastic-agent#1926

Closed

michalpristas removed their assignment Dec 28, 2022

cmacknz changed the title ~~Nil reference on policy change~~ Nil reference on agent policy change Dec 28, 2022

cmacknz transferred this issue from elastic/elastic-agent Dec 28, 2022

cmacknz mentioned this issue Dec 29, 2022

Add configurable numbness for component restarts elastic/elastic-agent#2003

Merged

6 tasks

cmacknz self-assigned this Dec 31, 2022

cmacknz mentioned this issue Dec 31, 2022

Fix stale checkin expected messages being sent on process restart elastic/elastic-agent#2036

Merged

jlind23 linked a pull request Jan 3, 2023 that will close this issue

Fix stale checkin expected messages being sent on process restart elastic/elastic-agent#2036

Merged

cmacknz closed this as completed Jan 3, 2023

cmacknz mentioned this issue Jan 3, 2023

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. elastic/elastic-agent#1912

Closed

ofiriro3 mentioned this issue Jan 4, 2023

Monitoring 8.6 SNAPSHOT to see that metricbeat doesn't crush cloudbeat elastic/cloudbeat#638

Closed

cmacknz mentioned this issue Jan 4, 2023

cloudbeat does not update log levels elastic/cloudbeat#623

Closed

blakerouse mentioned this issue Jan 4, 2023

[Fleet]: 'Unhealthy' state of Fleet server is shown under the 'Fleet > Agents' tab after installation. elastic/fleet-server#2231

Closed

cmacknz mentioned this issue Jan 4, 2023

No debug level logs are generated on changing log level to debug. elastic/elastic-agent#2012

Closed

cmacknz mentioned this issue Jan 12, 2023

Beats managed by agent can be left without a configured output elastic/elastic-agent#2086

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nil reference on agent policy change #34137

Nil reference on agent policy change #34137

michalpristas commented Dec 27, 2022

cmacknz commented Dec 28, 2022

cmacknz commented Dec 28, 2022 •

edited

Loading

cmacknz commented Dec 29, 2022 •

edited

Loading

cmacknz commented Dec 29, 2022

cmacknz commented Dec 29, 2022 •

edited

Loading

cmacknz commented Dec 29, 2022

cmacknz commented Dec 29, 2022

cmacknz commented Dec 30, 2022 •

edited

Loading

cmacknz commented Jan 3, 2023

Nil reference on agent policy change #34137

Nil reference on agent policy change #34137

Comments

michalpristas commented Dec 27, 2022

cmacknz commented Dec 28, 2022

cmacknz commented Dec 28, 2022 • edited Loading

cmacknz commented Dec 29, 2022 • edited Loading

cmacknz commented Dec 29, 2022

cmacknz commented Dec 29, 2022 • edited Loading

cmacknz commented Dec 29, 2022

cmacknz commented Dec 29, 2022

cmacknz commented Dec 30, 2022 • edited Loading

cmacknz commented Jan 3, 2023

cmacknz commented Dec 28, 2022 •

edited

Loading

cmacknz commented Dec 29, 2022 •

edited

Loading

cmacknz commented Dec 29, 2022 •

edited

Loading

cmacknz commented Dec 30, 2022 •

edited

Loading