Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912

Closed
amolnater-qasource opened this issue Dec 8, 2022 · 22 comments
Assignees
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

Kibana version: 8.6 BC6 kibana cloud environment

Host OS and Browser version: All, All

Build details:

VERSION: 8.6.0 BC6 Kibana cloud environment
BUILD: 58740
COMMIT: f329a77595950244361736dff7208a810299fd69

Preconditions:

  1. 8.6 BC6 kibana cloud environment should be available.
  2. Windows, Mac and linux agents should be installed.

Steps to reproduce:

  1. Navigate to Fleet tab.
  2. Update logging level for agents from agent logs.
  3. Observe agents move to Unhealthy state temporarily for (approximately 10 minutes).

Logs:
[Windows]elastic-agent-diagnostics-2022-12-08T08-38-54Z-00.zip
[Linux]elastic-agent-diagnostics-2022-12-08T08-32-12Z-00.zip
[MAC]elastic-agent-diagnostics-2022-12-08T08-32-38Z-00.zip

Screenshot:
6
7

Expected Result:
Agents should remain healthy on changing logging level under Agent Logs tab.

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Dec 8, 2022
@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Dec 8, 2022

I'm not entirely sure what is causing this, but I see Error while stopping harvester group: task failures\n\terror while adding new reader to the bookkeeper harvester is already running for file in both the Linux and Mac log files.

{"log.level":"error","@timestamp":"2022-12-08T08:28:43.761Z","message":"Error while stopping harvester group: task failures\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"prospector":"file_prospector","log.logger":"input.filestream","log.origin":{"file.line":294,"file.name":"filestream/prospector.go"},"id":"filestream-monitoring-agent","service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

I can also see this is the Windows logs right after the log level change:

{"log.level":"warn","@timestamp":"2022-12-08T08:27:13.855Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://37ac1814a0eb4fc2882b10eafd9e145b.fleet.us-central1.gcp.foundit.no:443/ errored: Post \"https://37ac1814a0eb4fc2882b10eafd9e145b.fleet.us-central1.gcp.foundit.no:443/api/fleet/agents/3f82c333-9c03-40c6-a099-a25fd8aec301/checkin?\": context canceled\n\n"},"request_duration_ns":0,"failed_checkins":1,"retry_after_ns":68201468007,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-08T08:27:13.855Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":207},"message":"checkin retry loop was stopped","ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Dec 8, 2022

Likely we should retest this after #1896

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2022

This should be resolved in the next 8.6 snapshot build or BC.

@dikshachauhan-qasource
Copy link

Hi @cmacknz

We have revalidated this issue on 8.6 BC7 Kibana staging and Prod environment and found this issue still reproducible.

Screenshot:
image

Build details:
BUILD: 58773
COMMIT: 511e7c66ad8b290feb0af3ea262ab3fb08cc87be
Artifact: https://staging.elastic.co/8.6.0-8cf9e954/summary-8.6.0.html

Please let us know if more details are required.

Thanks.

@jlind23
Copy link
Contributor

jlind23 commented Dec 15, 2022

@dikshachauhan-qasource the fix was merged to the 8.6 branch after the latest BC was built. Thus we need to wait for another BC.

@dikshachauhan-qasource
Copy link

Hi @jlind23

Thanks for the update. We will retest this on next BC.

@cmacknz
Copy link
Member

cmacknz commented Dec 22, 2022

@dikshachauhan-qasource @amolnater-qasource Please retest this with the latest snapshot being built today, along with #1959

@amolnater-qasource
Copy link
Author

Hi @cmacknz
We have revalidated this issue on latest 8.6 SNAPSHOT kibana cloud-staging environment and found it still reproducible.

Observations:

  • Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab.

Build details:
BUILD: 58830
COMMIT: 6a5d6d96a534be75fc58acda8f89f2610309d7ff
Artifact: https://snapshots.elastic.co/8.6.0-f6d7d537/downloads/beats/elastic-agent/elastic-agent-8.6.0-SNAPSHOT-windows-x86_64.zip

Screenshots:
1

Logs:
elastic-agent-diagnostics-2022-12-26T07-53-38Z-00.zip

Please let us know if anything else is required from our end.
Thanks

@michalpristas
Copy link
Contributor

unhealthy state is due to beat restarts
there's beat is restarting because output changed in the logs
please retest after #2003

@ghost
Copy link

ghost commented Dec 29, 2022

Hi @cmacknz,

We have re-validated this issue on the latest 8.6.0 BC9 Kibana Cloud environment and found the below observations:

Observations:

  • Agents goes to Unhealthy state temporarily on installation.
  • Agents goes to Unhealthy state temporarily on adding the integrations.
  • Agents goes to Unhealthy state temporarily on agent restart.

Screenshots:

image

Build details:

Version: 8.6.0 BC9
Build: 58832	
Commit: 93183bddac40f8a7ee8e566d1651f9f3b586a520

Agents Logs:

Please let us know if we are missing anything.

Thanks!

@michalpristas
Copy link
Contributor

the mentioned PR is not part of BC, please revalidate with SNAPSHOT

@jlind23
Copy link
Contributor

jlind23 commented Jan 3, 2023

@amolnater-qasource @dikshachauhan-qasource any update to provide here?

@amolnater-qasource
Copy link
Author

Hi @jlind23
We have revalidated this issue on latest 8.6 SNAPSHOT kibana cloud environment and had below observations:

  • Inconsistently, still agent gets unhealthy on changing log level to debug.
  • On restarting agent service/machine manually, agents gets back Healthy and debug log are generated.

OS:
Windows, Linux and MAC

Build details:
BUILD: 58836
COMMIT: c735e9fc6fdf0221cc3134b5fe110e9ae4a0effb
Artifact links: https://snapshots.elastic.co/8.6.0-39d183e3/downloads/beats/elastic-agent/elastic-agent-8.6.0-SNAPSHOT-linux-x86_64.tar.gz
https://snapshots.elastic.co/8.6.0-39d183e3/downloads/beats/elastic-agent/elastic-agent-8.6.0-SNAPSHOT-windows-x86_64.zip

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-01-03.17-06-33.mp4

On Restarting:
8
9

Logs Before agent restart:
[Before Restart]elastic-agent-diagnostics-2023-01-03T11-33-58Z-00.zip

Logs After agent restart:
[After Restart]elastic-agent-diagnostics-2023-01-03T12-09-00Z-00.zip

Please let us know if anything else is required from our end.
Thanks

@michalpristas
Copy link
Contributor

we're probably running into an issue related to restarts @cmacknz was working on and is not part of latest SNAPSHOT yet.

the issue is here: #2036
we can see output unit has no config before status changed to FAILED this message is happening during output reload when expected config is nil.

@cmacknz
Copy link
Member

cmacknz commented Jan 3, 2023

Yes, that is a symptom of elastic/beats#34137 which will be fixed in the next BC

@jlind23
Copy link
Contributor

jlind23 commented Jan 4, 2023

Closing this as elastic/beats#34137 was merged

@jlind23 jlind23 closed this as completed Jan 4, 2023
@cmacknz
Copy link
Member

cmacknz commented Jan 4, 2023

I think we are unnecessarily restarting the Beats when only the log level has changed for the output unit.

@cmacknz
Copy link
Member

cmacknz commented Jan 4, 2023

{"log.level":"info","@timestamp":"2023-01-03T11:22:01.766Z","log.origin":{"file.name":"handlers/handler_action_settings.go","file.line":68},"message":"Settings action done, setting agent log level to debug","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.774Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":729},"message":"Updating running component model","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed log-default-logfile-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed log-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed system/metrics-default-system/metrics-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed system/metrics-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (HEALTHY->CONFIGURING): Configuring","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed filestream-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed winlog-default-winlog-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed winlog-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.785Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (HEALTHY->CONFIGURING): Configuring","component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.785Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed http/metrics-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.890Z","message":"beat is restarting because output changed","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":503,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Jan 4, 2023

elastic/beats#34178 (not a blocker, just an optimization).

@amolnater-qasource
Copy link
Author

Hi @cmacknz
We have revalidated this issue on latest 8.6 BC10 Kibana cloud environment and found it fixed now.

Observations:

  • Agents remain Healthy on changing log level to debug.

Build details:
BUILD: 58852
COMMIT: d3a625ef4a6e611a5b3233a1ce5cbe8ef429eb47
Artifacts: https://staging.elastic.co/8.6.0-b6c773f9/summary-8.6.0.html#elastic-agent

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-01-05.15-14-05.mp4

Hence we are marking this issue as QA:Validated.
Thanks

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants