Skip to content
This repository has been archived by the owner on Aug 19, 2022. It is now read-only.

feat: Out of the box metrics for resource manager #54

Merged
merged 21 commits into from
Jul 1, 2022
Merged

Conversation

MarcoPolo
Copy link
Contributor

@MarcoPolo MarcoPolo commented Jun 21, 2022

This adds metrics for resource manager that allows an end user to see what's happening with their go-libp2p application.

This hooks into the existing tracing setup to emit metrics and aggregate them when appropriate.

Possible improvements:

  1. Record limits

Example graphs:

Note the histogram is not rendered correctly in this snapshot. See the screenshot below for how it should look.
https://snapshots-origin.grafana.net/dashboard/snapshot/nDCjmTCnHpmEpXgtcWO7ThEDS7B5DLLn?orgId=2

How the histogram should look like:
Screen Shot 2022-06-20 at 5 46 49 PM

I'll see if I can spin up a public ipfs dashboard to demo.

@MarcoPolo MarcoPolo linked an issue Jun 21, 2022 that may be closed by this pull request
"exemplar": true,
"expr": "histogram_quantile(0.50, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket)) - 0.1",
"interval": "",
"legendFormat": "p50 {{dir}} connections per peer",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo should be streams

obs/stats.go Outdated
Comment on lines 57 to 49
fibLikeDistribution = []float64{
1.1, 2.1, 3.1, 5.1, 8.1, 13.1, 21.1, 34.1, 55.1, 100.1, 200.1,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why this distribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fibonaccci with +0.1 because of a quirk in opencensus that does bound non-inclusive bound checking: https:/census-instrumentation/opencensus-go/blob/v0.23.0/stats/view/aggregation_data.go#L195.

But the fib sequence itself is totally arbitrary. Happy to change this to something else if you have a better idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No better ideas. The go-to distributions are linear and exponential in the Prometheus ecosystem:

https:/prometheus/client_golang/blob/4ad265f1b4ee8d5c239b3f36081529bbe3a838db/prometheus/histogram.go#L75-L145

@marten-seemann
Copy link
Contributor

Looking at the dashboard, the p90 values are below the p50 values. That doesn't seem correct, or am I missing something?

@MarcoPolo
Copy link
Contributor Author

@MarcoPolo
Copy link
Contributor Author

Looking at the dashboard, the p90 values are below the p50 values. That doesn't seem correct, or am I missing something?

Resolved out of band. The aggregated number of peers was misinterpreted. I changed the color to make it more obvious this isn't p50

Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure this is optional, enabled with a constructor option?
We shouldn't be forcing down the throat of developers any metrics solution.

Note that with regards to UX, this can be easily aggregated in a go-libp2p constructor metrics option, which initializes various components (including the rcmgr) with metrics.

@MarcoPolo
Copy link
Contributor Author

Can we make sure this is optional, enabled with a constructor option?
We shouldn't be forcing down the throat of developers any metrics solution.

Yep. There's little value to these if they aren't hooked up to an exporter like prometheus.

Note that with regards to UX, this can be easily aggregated in a go-libp2p constructor metrics option, which initializes various components (including the rcmgr) with metrics.

Good idea!

obs/stats.go Outdated
var log = logging.Logger("rcmgrObs")

var (
systemOutboundConns = stats.Int64("system/outbound/conn", "Number of outbound Connections", stats.UnitDimensionless)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

@MarcoPolo
Copy link
Contributor Author

I tried getting exemplars through, but while you can attach an exemplar via opencensus the opencensus prometheus exporter doesn't support emitting it.

@BigLep BigLep mentioned this pull request Jun 28, 2022
41 tasks
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably unify the scope name parsing at some point (doesn't need to be this PR).
#52 added a lot of similar logic for generating trace events.

trace.go Outdated
func WithTraceReporter(reporter TraceReporter) Option {
return func(r *resourceManager) error {
if r.trace == nil {
r.trace = &trace{reporters: []TraceReporter{reporter}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.trace = &trace{reporters: []TraceReporter{reporter}}
r.trace = &trace{}

Otherwise you'll add this reporter twice.


StreamView = &view.View{Measure: streams, Aggregation: view.Sum(), TagKeys: []tag.Key{directionTag, scopeTag, serviceTag, protocolTag}}
PeerStreamsView = &view.View{Measure: peerStreams, Aggregation: view.Distribution(oneTenThenExpDistribution...), TagKeys: []tag.Key{directionTag}}
PeerStreamNegativeView = &view.View{Measure: peerStreamsNegative, Aggregation: view.Distribution(oneTenThenExpDistribution...), TagKeys: []tag.Key{directionTag}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize we've talked about this before, but I'm still confused by it. Seems weird that it takes this convoluted way to get a bucketed view. Is this standard practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. From https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

In principle, however, you can use summaries and histograms to observe negative values (e.g. temperatures in centigrade). In that case, the sum of observations can go down, so you cannot apply rate() to it anymore. In those rare cases where you need to apply rate() and cannot avoid negative observations, you can use two separate summaries, one for positive and one for negative observations (the latter with inverted sign), and combine the results later with suitable PromQL expressions.

Our metrics don't have a notion of removing a previous observation. The best we can do is add a new negative observation that subtracts the previous observation, then add a new observation.

So we need to make two observations:

  1. We want to record the old number of streams this peer had (this is negative).
  2. We want to record the current number of streams this peer has.

In order to make a negative observation with histograms we need to record it in a separate measurement, then combine it later.

That's my understanding, but it's possible I'm misunderstanding something. @mxinden does the above seem correct to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our metrics don't have a notion of removing a previous observation.

I am not familiar enough with OpenCensus, thus I am not sure what their best practice is.

From https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

In principle, however, you can use summaries and histograms to observe negative values (e.g. temperatures in centigrade). In that case, the sum of observations can go down, so you cannot apply rate() to it anymore. In those rare cases where you need to apply rate() and cannot avoid negative observations, you can use two separate summaries, one for positive and one for negative observations (the latter with inverted sign), and combine the results later with suitable PromQL expressions.

I don't think this is referring to undoing a previous observation, but only to record both negative and positive observations. That is not to say that your trick here won't work. Mind sharing the queries you are using here? Are you actually using the histogram_quantile function? Note that histogram_quantile calculates the quantiles over time, thus I am not sure the above undo logic is compatible with it.

To come back to the original goal, do I understand correctly that you would like to answer the following question:

How many currently connected peers have 1, 10, 100, 1000, 10000 (arbitrary distribution) streams?

In case you only want to visualize the above with a bar chart, but not calculate quantiles, how about exposing a Prometheus Gauge for each of these buckets. You can inc and dec a Prometheus Gauge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that histogram_quantile calculates the quantiles over time

To be clear https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile uses an instant-vector so it calculates the quantile at each point in time. As opposed to something like rate which uses a range vector that can look at multiple timestamps for each point. (I'm sure you know this, just adding it here for completeness).

Mind sharing the queries you are using here? Are you actually using the histogram_quantile function?

histogram_quantile(0.90, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket))

This is similar to the commonly used example: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])). Except instead of using rate to get the rate of increase (by taking the current value and comparing it with a previous value) and calculating quantiles from that I'm using these two histograms to create a histogram of the current state and calculating quantiles with that.

Let me rephrase:

  1. We agree this example is correct: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])). Since it's from the docs.
  2. This example calculates quantiles on a new histogram. This new histogram is generated by doing http_request_duration_seconds_bucket - http_request_duration_seconds_bucket[previous_time] with the rate function (https:/prometheus/prometheus/blob/main/promql/functions.go#L74)
  3. What we're doing is keeping track of our last observation (the negative bucket), and generating a new histogram with our current state of the world with rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_bucket@previous_observation (aka negative bucket). Which I'm arguing is the same as 2, except we are explicit in what our previous observation was.
    a. Aside, we could use rate here as well and only use the one bucket, but then we'd be only seeing the differences of when a peer moves to a different bucket which is not as useful as seeing the current distribution of streams per peer. There might be a way with promql to do what I want, but I haven't figured it out (essentially what's the difference between the current value and the last different value).

Or another way to think about this is:

  1. Assume rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket generates a valid histogram, let's call it H.
  2. We should be able to caluclate the quantiles from this histogram H with the histogram_quantile function. (Which is actually quite simple: https:/prometheus/prometheus/blob/main/promql/quantile.go#L101.)

So the question becomes, is our assumption of 1 here correct? I'm arguing it is because it is the histogram of our current state of streams per peer. It also looks exactly like a histogram metric promql would expect (cumulative le distributions).

I know that's quite a lot, and I appreciate your input here! Thank you for thinking about this and checking my work :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that histogram_quantile calculates the quantiles over time

To be clear https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile uses an instant-vector so it calculates the quantile at each point in time. As opposed to something like rate which uses a range vector that can look at multiple timestamps for each point. (I'm sure you know this, just adding it here for completeness).

Mind sharing the queries you are using here? Are you actually using the histogram_quantile function?

histogram_quantile(0.90, (rcmgr_trace_metrics_peer_streams_bucket - rcmgr_trace_metrics_peer_streams_negative_bucket))

This is similar to the commonly used example: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])). Except instead of using rate to get the rate of increase (by taking the current value and comparing it with a previous value) and calculating quantiles from that I'm using these two histograms to create a histogram of the current state and calculating quantiles with that.

This is really clever! To be honest, I didn't think about not using rate with a histogram_quantile. Thanks for the explainer! 🙇

3. Aside, we could use rate here as well and only use the one bucket, but then we'd be only seeing the differences of when a peer moves to a different bucket which is not as useful as seeing the current distribution of streams per peer.

👍

3. There might be a way with promql to do what I want, but I haven't figured it out (essentially what's the difference between the current value and the last different value).

Isn't this impossible at the data level as one would need to know which decrease in one bucket matches a corresponding increase in another bucket and which increase is a new peer, thus does not have a decrease in another bucket?

@mxinden does the above seem correct to you?

The above seems correct to me!

In case you can influence the Prometheus metric # HELP text, I think it is worth a mention of the concept you are using here.

obs/stats.go Outdated Show resolved Hide resolved
rcmgr.go Outdated Show resolved Hide resolved
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is still unhappy about this PR though.

@MarcoPolo
Copy link
Contributor Author

(rebased off master to fix conflict)

@MarcoPolo MarcoPolo merged commit b8369c6 into master Jul 1, 2022
@MarcoPolo MarcoPolo deleted the marco/50-obs branch July 1, 2022 19:33
@ajnavarro ajnavarro mentioned this pull request Aug 24, 2022
72 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Usable out of the box metrics
4 participants