Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move detailed per CPU kernel stats under a new flag METRICS_V2_KERNEL_COUNTERS_PER_CPU #2028

Closed
incertum opened this issue Aug 28, 2024 · 6 comments · Fixed by #2031
Closed
Labels
kind/feature New feature or request
Milestone

Comments

@incertum
Copy link
Contributor

The new detailed kernel counters per CPU are great. At the same time it can add a lot of metrics, especially for beefy server with 128+ CPUs. Therefore proposing to move them under a new flag, e.g. METRICS_V2_KERNEL_COUNTERS_PER_CPU and expose a sub-config key in Falco to allow users to opt in or opt out of these new counters.

CC @Andreagit97 WDYT?

If we all agree we should prioritize this for 0.18.0 so that we have a coherent UX.

@incertum incertum added the kind/feature New feature or request label Aug 28, 2024
@incertum
Copy link
Contributor Author

/milestone 0.18.0

@poiana poiana added this to the 0.18.0 milestone Aug 28, 2024
@Andreagit97
Copy link
Member

Yes, it makes sense, I will take care of it! Thank you for the suggestion

@incertum
Copy link
Contributor Author

Thank you Andrea!

Few more thoughts:

  • So far we always enabled all metrics categories in Falco, while metrics themselves are disabled by default. For this metric category was thinking we should disable it by default. What are your thoughts?
  • We could also consider exposing statistical metrics in the future, such as the kurtosis or skewness of these counters, whenever a snapshot is taken in libsinsp. In some use cases, you might prefer to opt into receiving only these statistical metrics instead of all the raw counter fields. Other use cases may still require the raw counters. Happy to help with this if we all agree it's useful. Mentioning it now in case it is useful to shape the design.

@Andreagit97
Copy link
Member

So far we always enabled all metrics categories in Falco, while metrics themselves are disabled by default. For this metric category was thinking we should disable it by default. What are your thoughts?

Yep I agree

We could also consider exposing statistical metrics in the future, such as the kurtosis or skewness of these counters, whenever a snapshot is taken in libsinsp. In some use cases, you might prefer to opt into receiving only these statistical metrics instead of all the raw counter fields. Other use cases may still require the raw counters. Happy to help with this if we all agree it's useful. Mentioning it now in case it is useful to shape the design.

Yeah, it seems a great idea but I'm not sure what is the right place to obtain them. Definitely not an expert in metrics representation but is not possible to obtain these kinds of data directly in Prometheus or something similar starting from the metrics we expose today?

@incertum
Copy link
Contributor Author

We could also consider exposing statistical metrics in the future, such as the kurtosis or skewness of these counters, whenever a snapshot is taken in libsinsp. In some use cases, you might prefer to opt into receiving only these statistical metrics instead of all the raw counter fields. Other use cases may still require the raw counters. Happy to help with this if we all agree it's useful. Mentioning it now in case it is useful to shape the design.

Yeah, it seems a great idea but I'm not sure what is the right place to obtain them. Definitely not an expert in metrics representation but is not possible to obtain these kinds of data directly in Prometheus or something similar starting from the metrics we expose today?

On top of my head one pro could be to save the space of sending many raw metric fields for machines with many CPUs.

@Andreagit97
Copy link
Member

that's a good point. Let's say that I see these per-CPU metrics more as a debug option than something to run in production by default, if we end up using them we are probably in a critical situation with performance issues, so the overhead introduced by them could be acceptable since we are trying to debug...so yes statistical metrics are for sure useful but I'm not 100% sure this is the right case to apply them, WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants