Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace policy and action limiters with a checkin limiter #3255

Merged
merged 10 commits into from
Feb 9, 2024

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Feb 5, 2024

What is the problem this PR solves?

Scale tests for multiple policy changes are failing. A contributing factor is the policy limiter which increases the time it takes for policies to be dispatched (and the policy mutex lock to be held).

How does this PR solve the problem?

Replace the separate policy and action limiters with a unified limiter in the checkinT struct that is used if a response action (which includes policy change actions that are generated by the policy monitor) is detected in the checkin response, and gzip in enabled.

This means that the policyMonitor will dispatch pending policies much faster and release the lock so a policy may be updated/new subscriptions may be processed, but our checkin responses are still rate limited so we can reuse our gzip pool.

Note that the action_limit settings will be used and the policy_limit settings are ignored.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation pr here
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Replace the seperate policy and action limiters with a unified limiter
in the checkinT struct that is used if a response action (which includes
policy change actions that are generated by the policy monitor) is
detected in the checkin response, and gzip in enabled.
@michel-laterman michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team labels Feb 5, 2024
@michel-laterman
Copy link
Contributor Author

buildkite run perf-tests

@michel-laterman
Copy link
Contributor Author

serverless perf tests have failed due to ongoing AZ issues forcing containers to go oom.
ECS perf test run is here: https://buildkite.com/elastic/observability-perf/builds/2367#018d7f6d-39a4-4480-90f7-09c8e788ab95 and it looks to have succeeded.

@michel-laterman
Copy link
Contributor Author

@michel-laterman michel-laterman added enhancement New feature or request and removed bug Something isn't working labels Feb 6, 2024
@michel-laterman
Copy link
Contributor Author

latest ecs perf-test: https://buildkite.com/elastic/observability-perf/builds/2370#018d8079-17f9-4085-afd5-29bc28fe0433
looks like it's succeeding

} else if cfg.Limits.ActionLimit.Interval == 0 && cfg.Limits.PolicyThrottle == 0 {
rt = rate.Inf
}
zerolog.Ctx(context.TODO()).Debug().Any("event_rate", rt).Int("burst", cfg.Limits.ActionLimit.Burst).Msg("checkin response gzip limiter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't there be a ctx passed instead of context.TODO()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes, but we would need to make sure that the context is tied to the function instead of checkinT's lifecycle.
We also use context.TODO in a few other places similar to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an existing issue to address this: #3087

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@@ -99,6 +108,7 @@ func NewCheckinT(
gcp: gcp,
ad: ad,
tr: tr,
limit: rate.NewLimiter(rt, cfg.Limits.ActionLimit.Burst),
Copy link
Contributor

@juliaElastic juliaElastic Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no ActionLimit.Max setting, does it mean only the Burst is being limited?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interval and burst are used to configure the rate limiter (interval is the time it takes for 1 token to be added to the rate limit pool, burst is max pool size)
the max attributes was used to add limits to the total number of connections allowed on an endpoint (here we would use the checkin endpoint setting).

@michel-laterman
Copy link
Contributor Author

buildkite test this

@michel-laterman
Copy link
Contributor Author

buildkite run perf-tests

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
88.9% 88.9% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@michel-laterman michel-laterman merged commit c67e65d into elastic:main Feb 9, 2024
8 checks passed
@michel-laterman michel-laterman deleted the unify-limiters branch February 9, 2024 13:51
michel-laterman added a commit that referenced this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants