Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics semantic convention: "up" metric #1078

Closed
jmacd opened this issue Oct 8, 2020 · 18 comments
Closed

Metrics semantic convention: "up" metric #1078

jmacd opened this issue Oct 8, 2020 · 18 comments
Assignees
Labels
area:data-model For issues related to data model area:semantic-conventions Related to semantic conventions enhancement New feature or request priority:p2 Medium priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory

Comments

@jmacd
Copy link
Contributor

jmacd commented Oct 8, 2020

What are you trying to achieve?

up is a standard metric in Prometheus systems to indicate that a particular combination of job and instance was observed to be healthy. This comes from an active role taken by Prometheus servers in collecting metrics, but OpenTelemetry OTLP exporters can synthesize the same information on export to indicate that they are, in fact, up.

This issue proposes we introduce and specify this metric. Prometheus specifies this as 0- or 1-valued metric labeled by the job, instance labels. In OpenTelemetry the natural expression of this would be a label-free metric named "up", again 0- or 1-valued, reported along with the monitored Resource.

Additional context.

This variable would be synthesized in receivers for other metrics protocols. For example, the OTel collector's Prometheus or OpenMetrics receiver would be changed to generate this metric when it scrapes a target.

@jmacd jmacd added area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory labels Oct 8, 2020
@arminru arminru added area:semantic-conventions Related to semantic conventions enhancement New feature or request labels Oct 9, 2020
@andrewhsu andrewhsu added priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 9, 2020
@jgals
Copy link

jgals commented Oct 9, 2020

I'd like to work on this one with @cwildman. You can assign it to me @andrewhsu.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 12, 2020

Wondering if we should mark this release:allowed-for-ga?

This is an important step for a push-based metrics pipeline to produce equivalent data as a pull-based metrics pipeline would produce, which is a prerequisite for existing Prometheus users (or a barrier to migration to OTel SDKs), but it is not a requirement (maybe?).

cwildman added a commit to cwildman/opentelemetry-specification that referenced this issue Nov 12, 2020
@jmacd jmacd added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Dec 3, 2020
@andrewhsu andrewhsu added priority:p2 Medium priority level and removed priority:p1 Highest priority level labels Dec 8, 2020
@jmacd
Copy link
Contributor Author

jmacd commented Jan 28, 2021

Objective

Define how to transform a OTel-Collector Prometheus receiver (pull)
stream into OTLP such that it transforms into exactly the PRW output
that Prometheus or Grafana agent would write. State how the
OTel-Collector receiver should generate the up metric and staleness
markers to be semantically equivalent to Prometheus or Grafana agent
in this regard.

AND

Identify a way to transform an OTel SDK's exporter (push) stream into
OTLP such that it transforms into semantically correct PRW output,
using the same translation applied for OTLP to PRW described above.
Define a push-based method for reporting liveness that can be
semantically translated into up by a collector when received through
the OTLP receiver and exported through a PRW exporter.

Background: up and staleness markers in the pull data model

The up metric is synthesized in the Prometheus server after
completing a scrape, 1 for success, and 0 for failure. The
interpretation of a 0 is that the scrape failed, and when that
happens, staleness markers are entered in all the timeseries that
target has produced. The PRW consumer will see NaN values in the
timeseries, and the interpretation is that the process is either (a)
not alive, (b) has disconnected network, (c) faulty SDK, (d) etc.

When up recovers we may be able to determine whether the process
reset compared with other causes, but this knowledge has to be gained
indirectly (e.g., from another metric or label value). We cannot
directly determine that the process crashed, just that it couldn't be
monitored.

The meaning that the process was alive and that its metric SDK and the
intervening network is functioning.

Proposal: pushing an up metric

The up metric in the pull model says something about liveness AND
the ability to deliver metrics. To accomplish the same interpretation
in a push model, this proposal suggests starting with a liveness
metric (e.g., alive), an OTLP Non-Monotonic Cumulative Sum data
point that set to a constant 1. NMCS points have two timestamps, a
start and an end. The start time should be the process start time
(when it first became alive), the end time should be the time when the
report was generated.

An OTel collector will receive this report through an OTLP receiver.
When they pass through to the PRW exporter, the name "alive" will be
replaced by up. The timestamp on the emitted up metric point
should be the end of the window, the last moment it was known to be
alive and successfully reporting metrics.

Proposal: writing staleness markers from PRW

Since a dead task can't possibly push staleness markers about itself,
it will be the OTel-Collector PRW exporter's responsibility to export
staleness markers for streams that have gone stale. We can identify
staleness in several ways:

Push case: OTLP resources indicate a unique identifier for each
process. When the service.instance.id changes within a timeseries,
a staleness marker should be entered into the stream between the last
reported point and the point from the new process.

Pull case: The Prometheus up metric, written in the Prometheus
receiver, should signal to the PRW writer that all streams from the
same target (e.g., job and instance are identical) should be
include a stale marker.

@jmacd jmacd added area:data-model For issues related to data model and removed area:sdk Related to the SDK labels Feb 4, 2021
@fabxc
Copy link

fabxc commented Feb 5, 2021

The PRW consumer will see NaN values in the
timeseries, and the interpretation is that the process is either (a)
not alive, (b) has disconnected network, (c) faulty SDK, (d) etc.

NaN are actually valid series values, too. Since there are many possible NaN representations Prometheus defines two specific NaN representations "value NaNs" and "staleness NaNs" (see pkg/value).

So a proper consumer of Prometheus data has to distinguish between these two NaN variants. In Prometheus itself that happens at read time in the query engine. In OT it would need to happen in the write path.

The up metric in the pull model says something about liveness AND the ability to deliver metrics.

But also about many other things in between being configured and running correctly.
IIUC either up is not there or set to 1. That means all the states that up == 0 usually covers in Prometheus, are not covered by this proposal.

To properly cover all cases, the OT collector would need to run some form of target discovery like in Prometheus and pre-initialize up = 0 for all these potential targets. Then flip to up = 1 on receiving a pushed alive metric and then apply heuristics to flip it back to up = 0 if no new alive marker has been received in a while.

Whether the effort is worth it (to implement but more importantly for the user to configure additional target discovery for a single metric) is questionable. But without it I'm not sure its win for the user to have a simulated up metric with notably different semantics.
One metric name should mean one thing. So simply exposing the suggested alive metric under its own name seems perfectly fine to me. A lot of similar use cases to up can be solved but the user can clearly interpret up and alive respectively.

Push case: When the service.instance.id changes within a timeseries, a staleness marker should be entered into the stream between the last reported point and the point from the new process.

Is service.instance.id a unique identifier generated at application startup? Is it a label or some kind of non-identifying annotation?

In general the proposal means that the OT collector needs to track state for all series that have been passing through? That seems to give up on one of the major benefits of push-based clients. To reliably implement tracking (mostly it's garbage collection) it seems like some form of target discovery in the OT collector would be necessary.

Pull case: The Prometheus up metric, written in the Prometheus receiver, should signal to the PRW writer that all streams from the same target (e.g., job and instance are identical) should be include a stale marker.

Note that staleness markers are also set in Prometheus if a series disappears across scrapes two successful scrapes. So in principle one has to keep state and diff series of one scrape with the last. (The Prometheus scrape library does this out-of-the-box.)

@jmacd
Copy link
Contributor Author

jmacd commented Feb 9, 2021

@fabxc Thank you! This is very helpful. It begins to look like the up metric should be thought of as a JOIN operation between an output from service discovery and the process itself. Suppose we used the alive metric as proposed above for the process to report on itself, while service discovery pushes a metric such as present that when joined with alive yields the intended meaning of up? If both alive and present are 1, then the process is up. All other other combinations of present and alive should yield up that is stale (and/or other distinct forms of NaN) or 0. I realize we can never get exactly, behaviorally the behavior of a Prometheus server without being a Prometheus server, but our goal is not that, it's just to define the semantics.

@jmacd
Copy link
Contributor Author

jmacd commented Feb 9, 2021

The problem at hand, also stated in open-telemetry/wg-prometheus#8, may be connected with this thread about late-binding resources #1298 (comment).

Suppose we replace the "Service Discovery" component in an OTel-Collector Prometheus receiver with an independent service discovery metrics receiver that simply produces the present metric, as described above.

In the terms used in #1298, the present metric's Obligatory resource attributes would be matched exactly to an OTel SDK pushing metrics, and it would include all Optional resource attributes. The act of joining alive and present metrics to compute up would yield, as a side effect, a set of Optional resource attributes that are logically applicable to the associated spans and metrics, the same resource attributes known as __meta_ resources input to Prometheus relabeling.

@jsuereth
Copy link
Contributor

jsuereth commented Feb 9, 2021

@jmacd Let me see if I can rephrase this design in josh-bullet-points:

  • We have a service discovery component
    • This component is an independent receiver that just finds Prometheus endpoints
    • This component outputs a "present" metric for each endpoint it discovers
    • This component has an obligatory set of "resource labels" it attaches to the endpoint/present metric
  • We have the prometheus receiver
    • This receiver is (mostly) unchanged from the existing receiver
    • This receiver is starts pushing the alive metric from the proposal
  • We have the PRW exporter (or some processor) that joins these
    • We buffer alive/up metrics from the same resource
    • Labels are joined between alive/present
    • Up is synthesized as (kind of) alive && present

Is that the summary of the proposal?

Questions:

  1. Are you proposing a more general purpose Service Discovery component that does more than prometheus? LIke a new general interface that can be wired in with "add-ons" to the collector?
  2. Do you expect the join to ONLY happen for PRW? IIUC - we would need both alive + present to flow through OTLP independently any push-based collection distribution until we reach a "final" PRW.

@jmacd
Copy link
Contributor Author

jmacd commented Feb 10, 2021

@jsuereth Yes! I'm going to try to restate an example without referring to "receivers" other than OTLP. Everything pushes OTLP in this example; the collector receives only OTLP from external producers.

Let's say the service discovery producer writes present metrics, equal to 1, with a resource that combines identifying and descriptive attributes from service discovery (using the terms "identifying" and "descriptive" in the sense of this parallel discussion):

# Identifying attributes
job: J
instance: I

# Descriptive attributes: signals an OpenMetrics endpoint
__scheme__: http
__address__: 1.2.3.4
__metrics_path__: /metrics

# Descriptive attributes: for the user or relabeler
__meta_kubernetes_node_name: vvv
__meta_kubernetes_node_label_L: xxx
__meta_kubernetes_node_labelpresent_L: true
__meta_kubernetes_node_annotation_L: yyy
__meta_kubernetes_node_annotationpresent_L: true
__meta_kubernetes_node_address_addrtype: zzz

An OpenMetrics "pusher" could subscribe to (a shard of) the present metric from service discovery, to know which targets it should scrape. It would scrape its targets and push OTLP containing the target's data with an additional alive metric equal to 1 for each of the successful targets. It would push alive equal to 0 data when a scrape fails.

In a successful example, the OpenMetrics compnent pushes alive equal to 1 with a resource containing additional attributes known only to client library:

# Identifying attributes: applied by the scraper
job: J
instance: I

# Descriptive attributes: from the client library
os_name: linux
os_version: 3.2.1
telemetry_sdk_name: prometheus-golang
telemetry_sdk_version: 1.2.3

Both the service discovery and OpenMetrics components can be modeled as standalone producers of OTLP. Let's suppose that both of these metrics enter an OTel collector.

A collector stage can be defined that joins the present metric from service discovery with all OTLP metrics passing through. This means that every metric is extended with attributes from service discovery, not only alive. The output metrics combine resource attributes from the client library with resource attributes from service-discovery.

# Identifying attributes: subject of a natural join
job: J
instance: I

# Descriptive attributes: from the client library
os_name: linux
os_version: 3.2.1
telemetry_sdk_name: prometheus-golang
telemetry_sdk_version: 1.2

# Descriptive attributes: input to the relabeler
__meta_kubernetes_node_name: uuu
__meta_kubernetes_node_label_L: xxx
__meta_kubernetes_node_labelpresent_L: true
__meta_kubernetes_node_annotation_L: yyy
__meta_kubernetes_node_annotationpresent_L: true
__meta_kubernetes_node_address_addrtype: zzz

# Descriptive attributes: these are typically dropped after joining
__scheme__: http
__address__: 1.2.3.4
__metrics_path__: /metrics

The up metric can be calculated as simply present && alive in a PRW exporter. The up metric could also be computed in and written as OTLP, since it is otherwise something the user will have to compute themselves. This definition appears to give semantic meaning to up for both push and pull metrics, provided we are able to define the necessary join operation in our data model. I believe that should be the topic of a separate issue.

@jmacd
Copy link
Contributor Author

jmacd commented Feb 10, 2021

Adding to the explanation above, I imagine the following block diagram:

OTelAliveAndPresent

@jsuereth
Copy link
Contributor

Ok, let me restate once again to verify my understanding of responsibilities:

Service Discovery

  • is responsible for looking up identifying attributes of a Resource.
  • Is responsible for identifying possible metric sources and generating a present metric for each

SDK entrypoints

  • Push metrics with a subset of identifying labels, but enough for a 100% accurate join w/ Resource
  • In a prometheus-world, will push an alive metric

Collector

  • Must join all identifying attributes and ensure ALL metrics get these. This requires some kind of stateful awareness and stickyness to the metric pipeline.

Implications / Questions

  • The same collector needs to do service discovery as accepts the metric input.
    • I think this works for pull-based metrics.
    • I'm not sure on push-based metrics. I think you called this "last-mile" before, but it effectively ties us to sticky-collectors per-workload in the last-mile.
  • Service Discovery->Resource mapping/understanding is a separate concern from collection of metrics
    • I do like this separation of concerns, and I can see how it could help with, e.g. a statsd integration
    • Do you think we can write a Service Discovery component for push-based metrics? It would need to know about incoming service metrics AND be able to fill out resource labels.

Is there a good way for us to move some of this discussion into a "proposal"/"design" document to verbage up sections/implications and considerations? I'm really having trouble with GH issues and tracking all the things I want to ask or add :)

@fabxc
Copy link

fabxc commented Feb 16, 2021

As far as semantics go it seems we are on the same page. But I still don't fully understand who would produce alive metric.

It sounds like Prometheus scrapes would also start emitting an alive metric instead of directly producing the up metric?
That would prevent alive and up to explicitly distinguish between semantically different concepts again.

I like the idea of a "service-discovery-exporter" for the present metric to put it in Prometheus terms. Is there a particularly strong motivation to handle the join during the collection pipeline? It seems that is fairly complex to configure correctly and adds a lot of operational complexity as Josh mentioned.

Assuming a Prometheus backend (just because that's what I'm most familiar with), I'd see no issue for pull-based metrics to be accompanied by up metrics, and push-based metrics by an alive metric. present would universally exist for all of them.
A user could decide to create a recording rule to to produce up from alive/present independently from collection. Doing it at collection time doesn't seem to have any benefits, except for backends that don't have joins maybe.
I could also see users just handling each explicitly, i.e. for alerts just have mostly equivalent alerts for up and alive metrics. It's not really much overhead.

@rakyll
Copy link
Contributor

rakyll commented Feb 16, 2021

Is there a reason alive/present should be a part of the metrics spec? It seems like it's a generally useful concept for aliveness, service discovery and reporting resource attributes. Reporting alive/present as a metric is one of the use cases. For example, in PRW exporter's case, it could be turned into an "up" metric and PRW can also rely on this signal to identify staleness. But generally speaking, this topic is bigger than metrics cases and maybe we need a "service discovery and aliveness spec" to tackle the problem.

Having said that, part of what alive/present can provide is already captured by service discovery, app directories and health. Is it fair to reposition OpenTelemetry to capture these use cases at this point? What's the cost of rethinking about this problem and making it part of the data model in the long term? For the sake of the simplicity and orthogonality, is it possible to start with producing the "up" at the PWR exporter and think about a more comprehensive solution in the long term?

@weyert
Copy link

weyert commented Feb 17, 2021

I am not well versed in the metrics world but I had to read this discussion three times before understanding alive/present and up-metrics. Personally, they all seem to mean the same to me, indicates that a resource is available at a specific time.

@jsuereth
Copy link
Contributor

We had a lot of discussion on this topic across a few SiGs. I'd like to call out a few points and what I think is the consensus.

  • Do we think the existing metric data model can be used to encode "up" metrics in prometheus
    • Yes, we think there is a mechanism by which we can ensure "up" sent through OpenTelemetry pipeline is semantically equivalent.
    • @jmacd Has a proposal to encode "alive" and "present" metrics that would make sure OTEL API-instrumentation can also look more like Prometheus, in addition to how we'd handle prometheus scrape => PRW behavior.
    • This issue is no longer a DataModel concern, and belongs as part of other workstreams.
  • Should OTEL solve "service discovery"?
    • There's compelling ideas in here that we should investigate. Particularly a means to improve on statsd and make it more OTEL friendly.
    • To Jaana's point: OTEL has not invested in this at all yet, should we do so?
    • This is a great topic for exploration, but unless we drum up enough "power" to back it, we should defer this design until other higher priorities are taken care of. This is a great topic for a 2.0 :)
  • Should OTEL create "semantic conventions" around "up" metrics and staleness tracking.
    • There's a use-case around "did I collect metrics @ point x", related to "transactionality" or "atomicity" goals in Prometheus that likely needs guidance on push/pull "haromonization"
    • callout I think a document describing the use case of Up-metrics, staleness and the user-story around value would be useful here, as I don't think everyone sees the purpose right now.

Should we move this discussion to a "semantic convention" discussion given the above?

@jmacd
Copy link
Contributor Author

jmacd commented Apr 14, 2021

About the terminology proposed above, "present" may not be the greatest term to describe which services are available. Other terms potentially:

"available": I believe that Google's Monarch uses this term.
"roster": The english language definition very closely matches our technical meaning.

@jmacd
Copy link
Contributor Author

jmacd commented Apr 14, 2021

FYI Lightstep prototyped the push-based metric described here in this collector branch:

open-telemetry/opentelemetry-collector@main...lightstep:saladbar

We will continue this effort and share here. (CC: @paivagustavo)

@hdost
Copy link

hdost commented Apr 15, 2021

About the terminology proposed above, "present" may not be the greatest term to describe which services are available. Other terms potentially:

"available": I believe that Google's Monarch uses this term.

"roster": The english language definition very closely matches our technical meaning.

If service discovery is determining part of this then there are at least preliminary checks being done.
"available" certainly works.
FWIW i think "roster" might be more confusing than "present".

@jmacd
Copy link
Contributor Author

jmacd commented Nov 15, 2021

Closing this in favor of whatever is decided in open-telemetry/oteps#185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:data-model For issues related to data model area:semantic-conventions Related to semantic conventions enhancement New feature or request priority:p2 Medium priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants