Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912

amolnater-qasource · 2022-12-08T08:54:56Z

Kibana version: 8.6 BC6 kibana cloud environment

Host OS and Browser version: All, All

Build details:

VERSION: 8.6.0 BC6 Kibana cloud environment
BUILD: 58740
COMMIT: f329a77595950244361736dff7208a810299fd69

Preconditions:

8.6 BC6 kibana cloud environment should be available.
Windows, Mac and linux agents should be installed.

Steps to reproduce:

Navigate to Fleet tab.
Update logging level for agents from agent logs.
Observe agents move to Unhealthy state temporarily for (approximately 10 minutes).

Logs:
[Windows]elastic-agent-diagnostics-2022-12-08T08-38-54Z-00.zip
[Linux]elastic-agent-diagnostics-2022-12-08T08-32-12Z-00.zip
[MAC]elastic-agent-diagnostics-2022-12-08T08-32-38Z-00.zip

Screenshot:

Expected Result:
Agents should remain healthy on changing logging level under Agent Logs tab.

The text was updated successfully, but these errors were encountered:

amolnater-qasource · 2022-12-08T08:55:07Z

@manishgupta-qasource Please review.

manishgupta-qasource · 2022-12-08T09:25:32Z

Secondary review for this ticket is Done

cmacknz · 2022-12-08T20:59:28Z

I'm not entirely sure what is causing this, but I see Error while stopping harvester group: task failures\n\terror while adding new reader to the bookkeeper harvester is already running for file in both the Linux and Mac log files.

{"log.level":"error","@timestamp":"2022-12-08T08:28:43.761Z","message":"Error while stopping harvester group: task failures\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"prospector":"file_prospector","log.logger":"input.filestream","log.origin":{"file.line":294,"file.name":"filestream/prospector.go"},"id":"filestream-monitoring-agent","service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

I can also see this is the Windows logs right after the log level change:

{"log.level":"warn","@timestamp":"2022-12-08T08:27:13.855Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://37ac1814a0eb4fc2882b10eafd9e145b.fleet.us-central1.gcp.foundit.no:443/ errored: Post \"https://37ac1814a0eb4fc2882b10eafd9e145b.fleet.us-central1.gcp.foundit.no:443/api/fleet/agents/3f82c333-9c03-40c6-a099-a25fd8aec301/checkin?\": context canceled\n\n"},"request_duration_ns":0,"failed_checkins":1,"retry_after_ns":68201468007,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-08T08:27:13.855Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":207},"message":"checkin retry loop was stopped","ecs.version":"1.6.0"}

cmacknz · 2022-12-08T21:00:03Z

Likely we should retest this after #1896

cmacknz · 2022-12-14T19:01:59Z

This should be resolved in the next 8.6 snapshot build or BC.

dikshachauhan-qasource · 2022-12-15T11:39:22Z

Hi @cmacknz

We have revalidated this issue on 8.6 BC7 Kibana staging and Prod environment and found this issue still reproducible.

Screenshot:

Build details:
BUILD: 58773
COMMIT: 511e7c66ad8b290feb0af3ea262ab3fb08cc87be
Artifact: https://staging.elastic.co/8.6.0-8cf9e954/summary-8.6.0.html

Please let us know if more details are required.

Thanks.

jlind23 · 2022-12-15T13:45:06Z

@dikshachauhan-qasource the fix was merged to the 8.6 branch after the latest BC was built. Thus we need to wait for another BC.

dikshachauhan-qasource · 2022-12-16T05:55:44Z

Hi @jlind23

Thanks for the update. We will retest this on next BC.

cmacknz · 2022-12-22T20:45:19Z

@dikshachauhan-qasource @amolnater-qasource Please retest this with the latest snapshot being built today, along with #1959

amolnater-qasource · 2022-12-26T10:38:25Z

Hi @cmacknz
We have revalidated this issue on latest 8.6 SNAPSHOT kibana cloud-staging environment and found it still reproducible.

Observations:

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab.

Build details:
BUILD: 58830
COMMIT: 6a5d6d96a534be75fc58acda8f89f2610309d7ff
Artifact: https://snapshots.elastic.co/8.6.0-f6d7d537/downloads/beats/elastic-agent/elastic-agent-8.6.0-SNAPSHOT-windows-x86_64.zip

Screenshots:

Logs:
elastic-agent-diagnostics-2022-12-26T07-53-38Z-00.zip

Please let us know if anything else is required from our end.
Thanks

michalpristas · 2022-12-28T12:09:51Z

unhealthy state is due to beat restarts
there's beat is restarting because output changed in the logs
please retest after #2003

ghost · 2022-12-29T11:56:58Z

Hi @cmacknz,

We have re-validated this issue on the latest 8.6.0 BC9 Kibana Cloud environment and found the below observations:

Observations:

Agents goes to Unhealthy state temporarily on installation.
Agents goes to Unhealthy state temporarily on adding the integrations.
Agents goes to Unhealthy state temporarily on agent restart.

Screenshots:

Build details:

Version: 8.6.0 BC9
Build: 58832	
Commit: 93183bddac40f8a7ee8e566d1651f9f3b586a520

Agents Logs:

Windows:
- elastic-agent-diagnostics-2022-12-29T11-48-58Z-00.zip
Linux:
- elastic-agent-diagnostics-2022-12-29T11-52-11Z-00.zip
Mac:
- elastic-agent-diagnostics-2022-12-29T11-45-42Z-00.zip

Please let us know if we are missing anything.

Thanks!

michalpristas · 2022-12-30T10:36:20Z

the mentioned PR is not part of BC, please revalidate with SNAPSHOT

jlind23 · 2023-01-03T09:11:46Z

@amolnater-qasource @dikshachauhan-qasource any update to provide here?

amolnater-qasource · 2023-01-03T12:11:26Z

Hi @jlind23
We have revalidated this issue on latest 8.6 SNAPSHOT kibana cloud environment and had below observations:

Inconsistently, still agent gets unhealthy on changing log level to debug.
On restarting agent service/machine manually, agents gets back Healthy and debug log are generated.

OS:
Windows, Linux and MAC

Build details:
BUILD: 58836
COMMIT: c735e9fc6fdf0221cc3134b5fe110e9ae4a0effb
Artifact links: https://snapshots.elastic.co/8.6.0-39d183e3/downloads/beats/elastic-agent/elastic-agent-8.6.0-SNAPSHOT-linux-x86_64.tar.gz
https://snapshots.elastic.co/8.6.0-39d183e3/downloads/beats/elastic-agent/elastic-agent-8.6.0-SNAPSHOT-windows-x86_64.zip

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-01-03.17-06-33.mp4

On Restarting:

Logs Before agent restart:
[Before Restart]elastic-agent-diagnostics-2023-01-03T11-33-58Z-00.zip

Logs After agent restart:
[After Restart]elastic-agent-diagnostics-2023-01-03T12-09-00Z-00.zip

Please let us know if anything else is required from our end.
Thanks

michalpristas · 2023-01-03T12:54:22Z

we're probably running into an issue related to restarts @cmacknz was working on and is not part of latest SNAPSHOT yet.

the issue is here: #2036
we can see output unit has no config before status changed to FAILED this message is happening during output reload when expected config is nil.

cmacknz · 2023-01-03T13:49:38Z

Yes, that is a symptom of elastic/beats#34137 which will be fixed in the next BC

jlind23 · 2023-01-04T14:42:07Z

Closing this as elastic/beats#34137 was merged

cmacknz · 2023-01-04T15:51:43Z

I think we are unnecessarily restarting the Beats when only the log level has changed for the output unit.

cmacknz · 2023-01-04T15:52:40Z

{"log.level":"info","@timestamp":"2023-01-03T11:22:01.766Z","log.origin":{"file.name":"handlers/handler_action_settings.go","file.line":68},"message":"Settings action done, setting agent log level to debug","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.774Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":729},"message":"Updating running component model","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed log-default-logfile-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed log-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed system/metrics-default-system/metrics-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed system/metrics-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (HEALTHY->CONFIGURING): Configuring","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed filestream-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed winlog-default-winlog-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed winlog-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.785Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (HEALTHY->CONFIGURING): Configuring","component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.785Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed http/metrics-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.890Z","message":"beat is restarting because output changed","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":503,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

cmacknz · 2023-01-04T16:27:24Z

elastic/beats#34178 (not a blocker, just an optimization).

amolnater-qasource · 2023-01-05T10:16:22Z

Hi @cmacknz
We have revalidated this issue on latest 8.6 BC10 Kibana cloud environment and found it fixed now.

Observations:

Agents remain Healthy on changing log level to debug.

Build details:
BUILD: 58852
COMMIT: d3a625ef4a6e611a5b3233a1ce5cbe8ef429eb47
Artifacts: https://staging.elastic.co/8.6.0-b6c773f9/summary-8.6.0.html#elastic-agent

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-01-05.15-14-05.mp4

Hence we are marking this issue as QA:Validated.
Thanks

amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Dec 8, 2022

cmacknz assigned michalpristas Dec 22, 2022

amolnater-qasource mentioned this issue Dec 26, 2022

No debug level logs are generated on changing log level to debug. #2012

Closed

jlind23 closed this as completed Jan 4, 2023

amolnater-qasource added the QA:Validated Validated by the QA Team label Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912

amolnater-qasource commented Dec 8, 2022

amolnater-qasource commented Dec 8, 2022

manishgupta-qasource commented Dec 8, 2022

cmacknz commented Dec 8, 2022

cmacknz commented Dec 8, 2022

cmacknz commented Dec 14, 2022

dikshachauhan-qasource commented Dec 15, 2022

jlind23 commented Dec 15, 2022

dikshachauhan-qasource commented Dec 16, 2022

cmacknz commented Dec 22, 2022 •

edited

Loading

amolnater-qasource commented Dec 26, 2022

michalpristas commented Dec 28, 2022

ghost commented Dec 29, 2022

michalpristas commented Dec 30, 2022

jlind23 commented Jan 3, 2023

amolnater-qasource commented Jan 3, 2023

michalpristas commented Jan 3, 2023

cmacknz commented Jan 3, 2023

jlind23 commented Jan 4, 2023

cmacknz commented Jan 4, 2023

cmacknz commented Jan 4, 2023

cmacknz commented Jan 4, 2023

amolnater-qasource commented Jan 5, 2023

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912

Comments

amolnater-qasource commented Dec 8, 2022

amolnater-qasource commented Dec 8, 2022

manishgupta-qasource commented Dec 8, 2022

cmacknz commented Dec 8, 2022

cmacknz commented Dec 8, 2022

cmacknz commented Dec 14, 2022

dikshachauhan-qasource commented Dec 15, 2022

jlind23 commented Dec 15, 2022

dikshachauhan-qasource commented Dec 16, 2022

cmacknz commented Dec 22, 2022 • edited Loading

amolnater-qasource commented Dec 26, 2022

michalpristas commented Dec 28, 2022

ghost commented Dec 29, 2022

michalpristas commented Dec 30, 2022

jlind23 commented Jan 3, 2023

amolnater-qasource commented Jan 3, 2023

michalpristas commented Jan 3, 2023

cmacknz commented Jan 3, 2023

jlind23 commented Jan 4, 2023

cmacknz commented Jan 4, 2023

cmacknz commented Jan 4, 2023

cmacknz commented Jan 4, 2023

amolnater-qasource commented Jan 5, 2023

cmacknz commented Dec 22, 2022 •

edited

Loading