Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"failed to rotate backups" error after running agent enroll command. #24173

Closed
amolnater-qasource opened this issue Feb 23, 2021 · 15 comments · Fixed by #24466
Closed

"failed to rotate backups" error after running agent enroll command. #24173

amolnater-qasource opened this issue Feb 23, 2021 · 15 comments · Fixed by #24466
Assignees
Labels
bug Team:Elastic-Agent Label for the Agent team v7.13.0

Comments

@amolnater-qasource
Copy link

Kibana version: 7.12.0 Snapshot Kibana Cloud environment

Host OS and Browser version: Windows 10, All

Preconditions:

  1. 7.12.0 Snapshot Kibana cloud environment should be available.

Build Details:

Artifact link: https://snapshots.elastic.co/7.12.0-999aaf18/downloads/beats/elastic-agent/elastic-agent-7.12.0-SNAPSHOT-windows-x86_64.zip
Build: 38956
Commit: 90fc153d85334ec153204ac9d3702b26a38099e4

Steps to reproduce:

  1. Login to Kibana Cloud environment.
  2. Download and extract elastic-agent-7.12.0-SNAPSHOT-windows-x86_64.zip in program files.
  3. Navigate to Fleet>Agents>Add agent.
  4. Copy Enroll command for windows and run on powershell(admin).
  5. Observe "failed to rotate backups" error.
  6. Observe "Successfully enrolled the Elastic Agent." confirmation

Expected Result:
Agent should be installed without any errors.

Screenshots:
1

Note:
No impact on Agent working is observed.
5

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 23, 2021
@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@amolnater-qasource amolnater-qasource added the Team:Elastic-Agent Label for the Agent team label Feb 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 23, 2021
@manishgupta-qasource
Copy link

Reviewed & assigned to @EricDavisX

@manishgupta-qasource manishgupta-qasource added impact:high Short-term priority; add to current release, or definitely next. bug labels Feb 23, 2021
@EricDavisX
Copy link
Contributor

If nothing else is noted initially with Agent usage as seeming broken, and if the Agent team can confirm any log roll-over functionality is ok, then we can consider the priority a lower than 'high'. If we are not rolling over logs it is indeed a desired fix for 7.12. @michalpristas do you have ideas?

@EricDavisX
Copy link
Contributor

@ph fyi, thanks.

@EricDavisX EricDavisX removed the impact:high Short-term priority; add to current release, or definitely next. label Feb 23, 2021
@EricDavisX
Copy link
Contributor

there were recent changes with regards to enroll order, and start up. this is likely related. after inspecting we think logs are rolling over ok (after this start up error) and so it is lower priority. we can scope it for later if have urgent projects / bugs right now.

@michalpristas
Copy link
Contributor

identified the place where the issue is coming from, related to Enroll change.

at the time of installation we start installation process from TempDir P1

P1 unpacks and copies agent to Program Files/Elastic/Agent (further referred to as InstallDir)
P1 installs the service and start the service spinning up P2
P2 is actual agent process, it creates a log file

once P2 is up and running P1 continues to enrolling agent
to enroll P1 execs enrollment process P3 from InstallDir
P3 tries to rotate logs because there is already one from P1 causing Write to fail (we write info about enrollment)
P3 exits
P1 is restarted to reload fleet config (it was running standalone up to this point) and continues normally.

i would say impact is not that high for this but needs to be addressed.

we need to reduce number of agent processes here so they wont collide.
possible workaround/solution would be to call enroll as exec if we dont have service running and calling it using grpc if service is running.

another approach would be to use stdErr logger instead, we can redirect stdOut/Err from P3 to P1 so it is visible. i dont think we need to log it to a file. enroll should be always called either as a direct cmd or execing from install.

i think i would like std-redirects more but i would like to hear your opinion as well cc @blakerouse @ph @ruflin

@EricDavisX
Copy link
Contributor

Thanks for the research, that is very detailed and helpful! @ph and @michalpristas I would be in favor of pushing it to 7.13 dev, if it doesn't really impact Agent execution to keep changes in 7.12 down to those really warranted. Open to your decision here.

@ph
Copy link
Contributor

ph commented Feb 25, 2021

Agree on the proposition @EricDavisX, I have added it to the 7.13 iteration.

@blakerouse
Copy link
Contributor

@michalpristas I think the enroll command should use the stdErr logger only, I do not thing that enroll needs to write a log file. That would ensure that only the running Elastic Agent daemon is the only one that works with the log file.

@EricDavisX
Copy link
Contributor

When the 7.12 backport PR is merged we can test it out there - it did not make BC4 fyi

@dikshachauhan-qasource
Copy link

Hi @EricDavisX

Thanks for the confirmation on above. we will validate above in next coming BC's to follow up on the above reported.

Thanks
QAS

@EricDavisX
Copy link
Contributor

@dikshachauhan-qasource can you install the 7.13 snapshot stack (self-managed or in cloud-staging) and test this out there? If Agent installs ok and logs seem ok, that is enough for now. I don't think we have to wait to see logs rotated to feel good that the change is working, the lack of error message in the logs will be enough. I'm going to chat with team about pushing the 7.12 PR in, it has a failing 'Go' PR check.

@dikshachauhan-qasource
Copy link

Hi @EricDavisX,

Thanks for the update.

Further, we have validate above on 7.13 snapshot build and didnot found it reproducible there. Hence working fine on 7.13 snapshot.

Observations:

  • No error message: "failed to rotate backups" was displayed on endpoint while running install command.

Screenshot:
image

Build details:

BUILD 39526
COMMIT 6d23a9826606d065c5cbb467bb7bbe68f12da37b
Artifact link used: https://snapshots.elastic.co/7.13.0-7706124c/downloads/beats/elastic-agent/elastic-agent-7.13.0-SNAPSHOT-windows-x86_64.zip

Thanks
QAS

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated this issue on 7.12.0 BC-5 self managed Kibana and found it fixed.

Screenshot:

7

Build details:

Build: 39309
Commit: b7f9a41f486a2910ef22a1274ec734219c35ca3e
Artifact link used: https://staging.elastic.co/7.12.0-583fca05/downloads/beats/elastic-agent/elastic-agent-7.12.0-windows-x86_64.zip

Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team v7.13.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants