Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Fleet takes 10 minutes from install of Agent & policy change to show Endpoint state 'Running' #80930

Closed
EricDavisX opened this issue Oct 16, 2020 · 9 comments · Fixed by #81376
Assignees
Labels
bug Fixes for quality problems that affect the customer experience regression Team:Fleet Team label for Observability Data Collection Fleet team v7.10.0

Comments

@EricDavisX
Copy link
Contributor

EricDavisX commented Oct 16, 2020

We did some performance tuning and increased the general polling time to 5 minutes, I believe it is related to this:
#75552

Then we intended at least a partial fix here: #78493

I'm not sure between the 2 that we have the user experience we want. I have recorded a movie and plotted the relevant times in screen capture to tell the story.

I feel this is hurting the initial user experience (and demo experience) and I think it may be causing automated tests to fail that were written prior to the 5 min change.

Kibana version:
7.11 snapshot with 7.11 Observability CI agent deployed to cloud

Browser version:
Chrome on Mac

Describe the bug:
When I install Agent, it is fairly quick to show some progress in the Activity Details, but when I change policy it takes a full 5 minutes to see updates in the log that Agent has the policy change and then 5 more mins to show Endpoint running

Steps to reproduce:

  1. deploy Agent with default policy and new 7.11 'install' command. I used a centos 8 host
  2. wait for it to come on line... a few seconds to see it enrolling, 10+ more to see it 'on-line', tho with no data in 'Activity log'. Wait about a minute to see Beats online
  3. change policy to one with 'Endpoint and wait 5 mins - then maybe wait more.

Expected behavior:

  1. in prior versions it was faster, perhaps up to 60 seconds total for to Endpoint start up and show as Running, I'd like it to be as quick as that if possible

My original thought (if this is at all feasible) was to have the check-in be dynamic after a policy change comes in... perhaps for the next 5 minutes, it can check every minute - and after that go back to the 5 min poll. Or the temporary 1 minute poll could be for a shorter interval if it isn't needed, I'm not 100% sure what is happening between Endpoint and Agent here, I just know it feels long.

the movie is boring but tells a good story, for the sake of posting to the issue, I have captured the moments where *something happens and is shown to the user in the Activity Log.

sequence:
1 0mins-21seconds-enrolling
1-0mins-21seconds-enrolling

2 0mins-29seconds-online
2-0mins-29seconds-onoine

3 0mins-53seconds-running-and-start-to-change-policy
3-0mins-53seconds-running-and-start-to-change-policy

4 1min-0seconds-policy-reassigned-toast
4-1min-0seconds-policy-reassigned-toast

5 5mins-33-seconds-no-endpoint-activity
5-5mins-33-seconds-no-endpoint-activity

6 10mins-13-seoncs-no-endpoint-activity-yet
6-10mins-13-seoncs-no-endpoint-activity-yet

7 5mins-44-seconds-a-policy-change-shows-up
7-5mins-44-seconds-policy-change-shows-up

8 - 10mins-47-seconds-in-3-security-check-ins-show-up-together
z8-10mins-47-seconds-in-3-security-check-ins-show-up-together

I'll post the movie to our internal google drive:
https://drive.google.com/file/d/1SaATukD3Y6UkThZTfrNGiAtJZiazaeDH

@EricDavisX EricDavisX added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 16, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@EricDavisX
Copy link
Contributor Author

@ph @nchaulet @ferullo @gogochan - Dan, do you think Endpoint could have any part to play in this, or is it all Agent side? I don't know if anyone can help break down the timing of what is sent and when by the Agent or Endpoint... If this is all just 'showing it in the logs' I still feel it doesn't show as well as we'd like.

@ph ph added the bug Fixes for quality problems that affect the customer experience label Oct 19, 2020
@ph
Copy link
Contributor

ph commented Oct 19, 2020

I think this might be more on the Kibana side, the changes on the Agent and endpoint should be almost instance. I've assigned @nchaulet to it and marked it as a bug.

@ferullo
Copy link
Contributor

ferullo commented Oct 21, 2020

deploy Agent with default policy and new 7.11 'install' command. I used a centos 8 host

Did you mean 7.11 or 7.10?

I also doubt this is an Agent or Endpoint issue. Watching what happens on the host and/or sharing logs from Agent and Endpoint would confirm/dispute if it is a Kibana or host-side issue.

@ph ph added v7.10.0 and removed v7.11.0 labels Oct 21, 2020
@ph
Copy link
Contributor

ph commented Oct 21, 2020

I've moved it back to 7.10, I also think this is a Kibana issue here? @nchaulet any progress on this?

@nchaulet
Copy link
Member

Yes I was just investigating it, there is a bug for Kibana for sure, looks like the action INTERNAL_POLIC_REASIGN is not working each time working on fixing it

@nchaulet
Copy link
Member

nchaulet commented Oct 21, 2020

After more investigation there is two problem here:

First problem
There is a bug on Kibana on the policy reassignment this should be fixed by #81376
This will allow to gain ~5 minutes to have the logs

Second problem
Looks like the logs for STATE POLICY and STATE RUNNING for endpoint happens asynchronously from the config acknowledgment this result in these events being send during the next checkin, with a 5 minute delay
This will be fixed in 7.11 as we rewrite how we send agent logs/agent status, so should we invest more time here to fix that or the 5 minute delay is acceptable? @EricDavisX @ferullo

Screen Shot 2020-10-21 at 2 50 21 PM

@EricDavisX
Copy link
Contributor Author

There is a work-around of installing the Agent with the Endpoint integration in the policy up-front, instead of switching to it - this reduces the wait time to see the Logs to circa 2 minutes, which is tolerable, if not fully optimized.

It is harder without seeing it in action, but knowing that we have a confirmed bug that have a PR open which will reduce the cited scenario (by a significant margin), I would be ok to get that pr (# 81376) merged for 7.10 and call it sufficient. I submit we can accept this until the refactor in 7.11 cycle

Dan, I was using 7.11 as my test ground for some reason (can't remember why), knowing that the code I wanted to assess was there, it was an ok testing ground. I'm pleased we can get this improved for 7.10 - nice work.

@ghost
Copy link

ghost commented Dec 1, 2020

Bug Conversion:

Created 01 Testcase for this ticket:
https://elastic.testrail.io/index.php?/cases/view/35172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience regression Team:Fleet Team label for Observability Data Collection Fleet team v7.10.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants