[Fleet] Fleet takes 10 minutes from install of Agent & policy change to show Endpoint state 'Running' #80930

EricDavisX · 2020-10-16T22:27:55Z

We did some performance tuning and increased the general polling time to 5 minutes, I believe it is related to this:
#75552

Then we intended at least a partial fix here: #78493

I'm not sure between the 2 that we have the user experience we want. I have recorded a movie and plotted the relevant times in screen capture to tell the story.

I feel this is hurting the initial user experience (and demo experience) and I think it may be causing automated tests to fail that were written prior to the 5 min change.

Kibana version:
7.11 snapshot with 7.11 Observability CI agent deployed to cloud

Browser version:
Chrome on Mac

Describe the bug:
When I install Agent, it is fairly quick to show some progress in the Activity Details, but when I change policy it takes a full 5 minutes to see updates in the log that Agent has the policy change and then 5 more mins to show Endpoint running

Steps to reproduce:

deploy Agent with default policy and new 7.11 'install' command. I used a centos 8 host
wait for it to come on line... a few seconds to see it enrolling, 10+ more to see it 'on-line', tho with no data in 'Activity log'. Wait about a minute to see Beats online
change policy to one with 'Endpoint and wait 5 mins - then maybe wait more.

Expected behavior:

in prior versions it was faster, perhaps up to 60 seconds total for to Endpoint start up and show as Running, I'd like it to be as quick as that if possible

My original thought (if this is at all feasible) was to have the check-in be dynamic after a policy change comes in... perhaps for the next 5 minutes, it can check every minute - and after that go back to the 5 min poll. Or the temporary 1 minute poll could be for a shorter interval if it isn't needed, I'm not 100% sure what is happening between Endpoint and Agent here, I just know it feels long.

the movie is boring but tells a good story, for the sake of posting to the issue, I have captured the moments where *something happens and is shown to the user in the Activity Log.

sequence:
1 0mins-21seconds-enrolling

2 0mins-29seconds-online

3 0mins-53seconds-running-and-start-to-change-policy

4 1min-0seconds-policy-reassigned-toast

5 5mins-33-seconds-no-endpoint-activity

6 10mins-13-seoncs-no-endpoint-activity-yet

7 5mins-44-seconds-a-policy-change-shows-up

8 - 10mins-47-seconds-in-3-security-check-ins-show-up-together
z

I'll post the movie to our internal google drive:
https://drive.google.com/file/d/1SaATukD3Y6UkThZTfrNGiAtJZiazaeDH

elasticmachine · 2020-10-16T22:27:57Z

Pinging @elastic/ingest-management (Team:Ingest Management)

EricDavisX · 2020-10-16T22:29:50Z

@ph @nchaulet @ferullo @gogochan - Dan, do you think Endpoint could have any part to play in this, or is it all Agent side? I don't know if anyone can help break down the timing of what is sent and when by the Agent or Endpoint... If this is all just 'showing it in the logs' I still feel it doesn't show as well as we'd like.

ph · 2020-10-19T12:20:09Z

I think this might be more on the Kibana side, the changes on the Agent and endpoint should be almost instance. I've assigned @nchaulet to it and marked it as a bug.

ferullo · 2020-10-21T00:22:49Z

deploy Agent with default policy and new 7.11 'install' command. I used a centos 8 host

Did you mean 7.11 or 7.10?

I also doubt this is an Agent or Endpoint issue. Watching what happens on the host and/or sharing logs from Agent and Endpoint would confirm/dispute if it is a Kibana or host-side issue.

ph · 2020-10-21T14:24:34Z

I've moved it back to 7.10, I also think this is a Kibana issue here? @nchaulet any progress on this?

nchaulet · 2020-10-21T15:00:02Z

Yes I was just investigating it, there is a bug for Kibana for sure, looks like the action INTERNAL_POLIC_REASIGN is not working each time working on fixing it

nchaulet · 2020-10-21T18:58:17Z

After more investigation there is two problem here:

First problem
There is a bug on Kibana on the policy reassignment this should be fixed by #81376
This will allow to gain ~5 minutes to have the logs

Second problem
Looks like the logs for STATE POLICY and STATE RUNNING for endpoint happens asynchronously from the config acknowledgment this result in these events being send during the next checkin, with a 5 minute delay
This will be fixed in 7.11 as we rewrite how we send agent logs/agent status, so should we invest more time here to fix that or the 5 minute delay is acceptable? @EricDavisX @ferullo

EricDavisX · 2020-10-21T20:02:31Z

There is a work-around of installing the Agent with the Endpoint integration in the policy up-front, instead of switching to it - this reduces the wait time to see the Logs to circa 2 minutes, which is tolerable, if not fully optimized.

It is harder without seeing it in action, but knowing that we have a confirmed bug that have a PR open which will reduce the cited scenario (by a significant margin), I would be ok to get that pr (# 81376) merged for 7.10 and call it sufficient. I submit we can accept this until the refactor in 7.11 cycle

Dan, I was using 7.11 as my test ground for some reason (can't remember why), knowing that the code I wanted to assess was there, it was an ok testing ground. I'm pleased we can get this improved for 7.10 - nice work.

ghost · 2020-12-01T11:39:16Z

Bug Conversion:

Created 01 Testcase for this ticket:
https://elastic.testrail.io/index.php?/cases/view/35172

EricDavisX added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 16, 2020

ph added the bug Fixes for quality problems that affect the customer experience label Oct 19, 2020

ph assigned nchaulet Oct 19, 2020

ph added the v7.11.0 label Oct 19, 2020

EricDavisX added the regression label Oct 20, 2020

ph added v7.10.0 and removed v7.11.0 labels Oct 21, 2020

nchaulet mentioned this issue Oct 21, 2020

[Fleet] Fix agent action observable for long polling #81376

Merged

nchaulet closed this as completed in #81376 Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Fleet takes 10 minutes from install of Agent & policy change to show Endpoint state 'Running' #80930

[Fleet] Fleet takes 10 minutes from install of Agent & policy change to show Endpoint state 'Running' #80930

EricDavisX commented Oct 16, 2020 •

edited

Loading

elasticmachine commented Oct 16, 2020

EricDavisX commented Oct 16, 2020

ph commented Oct 19, 2020

ferullo commented Oct 21, 2020

ph commented Oct 21, 2020

nchaulet commented Oct 21, 2020

nchaulet commented Oct 21, 2020 •

edited

Loading

EricDavisX commented Oct 21, 2020

ghost commented Dec 1, 2020

[Fleet] Fleet takes 10 minutes from install of Agent & policy change to show Endpoint state 'Running' #80930

[Fleet] Fleet takes 10 minutes from install of Agent & policy change to show Endpoint state 'Running' #80930

Comments

EricDavisX commented Oct 16, 2020 • edited Loading

elasticmachine commented Oct 16, 2020

EricDavisX commented Oct 16, 2020

ph commented Oct 19, 2020

ferullo commented Oct 21, 2020

ph commented Oct 21, 2020

nchaulet commented Oct 21, 2020

nchaulet commented Oct 21, 2020 • edited Loading

EricDavisX commented Oct 21, 2020

ghost commented Dec 1, 2020

EricDavisX commented Oct 16, 2020 •

edited

Loading

nchaulet commented Oct 21, 2020 •

edited

Loading