Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent and beats store too much ReplicaSet data in K8s #5623

Closed
6 tasks done
swiatekm opened this issue Sep 30, 2024 · 8 comments
Closed
6 tasks done

Agent and beats store too much ReplicaSet data in K8s #5623

swiatekm opened this issue Sep 30, 2024 · 8 comments
Assignees
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@swiatekm
Copy link
Contributor

swiatekm commented Sep 30, 2024

When Deployment metadata is enabled, either in the agent K8s provider or in beats processors, agent and beats keep a local cache of ReplicaSet data. The only part of this cache they actually need is the ReplicaSet name and owner references, so they can connect the Pod to the Deployment.

In small to medium clusters this doesn't make that big a difference. In large clusters, however, you can have quite a few ReplicaSets - up to 10 for each Deployment, by default. This is compounded by the fact that we keep multiple copies of this data:

  • in the Kubernetes provider in elastic-agent
  • in the add_kubernetes_metadata processor in beats
  • in the metricbeat kubernetes module

I strongly suspect this is the primary root cause of the issue reported by our SRE Team in #4729, where elastic-agent is approaching 5 Gi of memory usage in a cluster with ~75k ReplicaSets.

This issue was split off from #4729 to avoid confusing it with an unrelated issue where agent uses too much memory on Pod data.

Data

Production

Below is a heap profile of agent running in the aforementioned cluster, provided by @henrikno in #4729 (comment):

Image

If you also look at the linked output of ps in the container, you can see elastic-agent and the metricbeat collecting k8s metrics using a lot more memory than all the other processes.

Test cluster

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets. I then tested the same elastic-agent workload, using the default standalone manifests, with deployment metadata additionally enabled in the kubernetes provider. I also set GOGC to 25 to make it easier to see the difference in actual memory usage. Finally, I built an elastic-agent image with all the fixes applied locally. The following shows the difference in memory usage, as measured by the system.process.memory.size metric:

Image

Fix

Since the fix will require changes in at least 3 different components across two repositories, most of it should happen in https:/elastic/elastic-agent-autodiscover. It'll require three major changes:

Subsequently, we'll need to use these new components in both agent and beats. A PoC of how this may look like in agent, with all the autodiscovery customizations, can be found here: #5580.

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@MichaelKatsoulis
Copy link
Contributor

@swiatekm You wrote here

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets.

So you had at least 6500 pods running in a 3 nodes cluster? 110 pod per node is the Kubernetes limit.

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 8, 2024

@swiatekm You wrote here

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets.

So you had at least 6500 pods running in a 3 nodes cluster? 110 pod per node is the Kubernetes limit.

No, I had 6500 ReplicaSets, all of which were scaled down to 0. I created them by first creating 1000 Deployments with 0 replicas each, and then changing the container image in each of them a bunch of times, spawning a new ReplicaSet each time.

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Oct 8, 2024

@swiatekm You wrote here

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets.

So you had at least 6500 pods running in a 3 nodes cluster? 110 pod per node is the Kubernetes limit.

From my personal experience what you say @MichaelKatsoulis is indeed true, there is such a limit. But this means that the pod will be on Pending state indeed but the corresponding events through the informer will still "flow" through as the object will be created in the API server. Hence, the "workload" can be considered equal of having 6500 replicasets going to the elastic-agent through the informer?!

@MichaelKatsoulis
Copy link
Contributor

Yes you are right. The pods will be either pending or not existing if the replicas are set to 0.
But the deployments and replicaSets will be created and the watcher will collect them.

In the past the enrichment of a pod with the deployment name was done in a different way. For each pod, there was a direct API call to get the replicaSet of that pod and get the deployment name , and then append it to the pod.

We then figured that direct API calls was not the most efficient way and we switched to creating a replicaSet watcher, so to have all replicaSet data already, so when a pod appeared, it would get the deployment name from the replicaSet data already saved in memory.

Problem is, as you also mention, in Clusters with so many deployments/replicaSets. But still the approach of the replicaSet watcher is better than the direct API call. In normal scenarios there will be pods created by the deployments and if there are thousands that would lead to thousands of requests. The watcher-informer mechanism is surely better.

We should keep in mind that the PRs you have created affect the kubernetes provider and the add_kubernetes_metadata processor.
The kubernetes provider is used for log collection and enhancement of pods and containers with metadata.
The add_kubernetes_metadata processor is not used at all when Kubernetes Integration is installed. It is started by default, but the data it collects are later dropped, as the data are already there. Unfortunately it is started by default and cannot be switched off.

For the kubernetes metrics collection, there is one watcher created per resource type. So if state_replicaseSet datastream is enabled (it is by default) it will start replicaSet watcher, which will collect everything. But this is the intention as it adds all replicaSet metadata to the events. Unlike the replicaSet watcher from the kubernetes provider and processor which start the watcher just for the deployment name.

So in the end, there could be as many as 3 replicaSet watchers that get started:

  1. In each elastic-agent due to the provider (If deployment:true is set in the add_resource_metadata config)
  2. In each elastic-agent due to the add_kubernetes_metadata processor (The feature deployment:false is default now, so it will actually not start a replicaSet watcher)
  3. In leader elastic-agent due to the state_replicaset datastream

Additionally, the approach of selecting what to keep from the watcher's data could also be used for cronJobs. In environments with too many cronJobs, the pods generated are enhanced with CronJob name by starting a Job watcher. It gets started just to get the CronJob name. We have also set cronjob: false by default in the add_resource_metadata config to mitigate memory issues.
Probably we can re use parts of the code of the PR in elastic-agent-autodiscover lib to tackle this as well.

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 8, 2024

Thanks for the explanation @MichaelKatsoulis, what you wrote is also consistent with what I've learned digging through the beats codebase over the past two weeks. And I agree that we can use the same approach for Jobs. I'm doing it for ReplicaSets because this is an issue that currently affects us internally, and I didn't want to complicate things by making another change at the same time.

For the kubernetes metrics collection, there is one watcher created per resource type. So if state_replicaseSet datastream is enabled (it is by default) it will start replicaSet watcher, which will collect everything. But this is the intention as it adds all replicaSet metadata to the events. Unlike the replicaSet watcher from the kubernetes provider and processor which start the watcher just for the deployment name.

I think it's fine to limit this watcher to only metadata as well. It can't use the default transform function I added, as it needs labels and annotations as well. But it doesn't need data from the ReplicaSet spec - that is already present in the metric samples collected from kube-state-metrics.

@MichaelKatsoulis
Copy link
Contributor

FYI I opened this issue for Jobs watcher
#5788

@swiatekm
Copy link
Contributor Author

This is now fixed in both agent and beats, in every maintained 8.x branch. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants