Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meter and instrument identity #2226

Closed
legendecas opened this issue Dec 16, 2021 · 28 comments · Fixed by #2317
Closed

Meter and instrument identity #2226

legendecas opened this issue Dec 16, 2021 · 28 comments · Fixed by #2317
Assignees
Labels
area:api Cross language API specification issue area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory

Comments

@legendecas
Copy link
Member

What are you trying to achieve?

In the metrics API spec it is said for MeterProvider.getMeter:

Implementations MUST NOT require users to repeatedly obtain a Meter again with the same name+version+schema_url to pick up configuration changes. This can be achieved either by allowing to work with an outdated configuration or by ensuring that new configuration applies also to previously returned Meters.
https:/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#get-a-meter

It also suggests that:

It is unspecified whether or under which conditions the same or different Meter instances are returned from this function.

So, as long as SDK propagates the configurations to all meters created without another getMeter call, it should be spec-compliant for either returning the same or a new Meter instance for every name+version+schema_url pair.

However, for the instrument name in API spec, it is said:

Meter implementations MUST return an error when multiple Instruments are registered under the same Meter instance using the same name.
https:/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument

It is not clear when the SDK always returning a new Meter instance for every name+version+schema_url pair, if the instrument name should be resolved in the Meter instance, or it should be checked SDK-wide with the same meter name+version+schema_url? If it is resolved within a single Meter instance, users could create multiple meter instances with the same meter name+version+schema_url and create an instrument with the same name in these meters. In that case, I'm not sure how the unique instrument name constraints can effectively work, since the exporters may observe multiple same name instruments associated with identical instrumentation libraries.

@legendecas legendecas added the spec:metrics Related to the specification/metrics directory label Dec 16, 2021
@arminru arminru added the area:api Cross language API specification issue label Dec 21, 2021
@reyang reyang added this to the Metrics API/SDK Stable Release milestone Jan 11, 2022
@reyang reyang added the area:sdk Related to the SDK label Jan 11, 2022
@reyang
Copy link
Member

reyang commented Jan 11, 2022

It is not clear when the SDK always returning a new Meter instance for every name+version+schema_url pair, if the instrument name should be resolved in the Meter instance, or it should be checked SDK-wide with the same meter name+version+schema_url? If it is resolved within a single Meter instance, users could create multiple meter instances with the same meter name+version+schema_url and create an instrument with the same name in these meters. In that case, I'm not sure how the unique instrument name constraints can effectively work, since the exporters may observe multiple same name instruments associated with identical instrumentation libraries.

https:/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#view

Maybe we need to improve the wording of the SDK spec. Take an example:

Library X:
  Class A:
    Meter("Foo")
      Instrument("Bar")
  Class B:
    Meter("Foo")
      Instrument("Bar")
  • If the language implementation decided to return the SAME instance of Meter("Foo"), the 2nd instrument creation would fail according to the API spec.
  • If the language implementation decided to return DIFFERENT instances of Meter("Foo"), both instruments will get created and can be used to report measurements.
    However, the SDK will apply the View configuration (if the View is not explicitly specified, a default behavior would be given based on the instrument type), it will follow the steps from the SDK spec and drop the 2nd one (because "the name is already used by another View"), thus the exporter won't get any data from the 2nd one:
The SDK SHOULD use the following logic to determine how to process Measurements
made with an Instrument:

* Determine the `MeterProvider` which "owns" the Instrument.
* If the `MeterProvider` has no `View` registered, take the Instrument and apply
    the default configuration.
* If the `MeterProvider` has one or more `View`(s) registered:
  * For each View, if the Instrument could match the instrument selection
    criteria:
    * Try to apply the View configuration. If there is an error or a conflict
      (e.g. the View requires to export the metrics using a certain name, but
      the name is already used by another View), provide a way to let the user
      know (e.g. expose
      [self-diagnostics logs](../error-handling.md#self-diagnostics)).
  * If the Instrument could not match with any of the registered `View`(s), the
    SDK SHOULD provide a default behavior. The SDK SHOULD also provide a way for
    the user to turn off the default behavior via MeterProvider (which means the
    Instrument will be ignored when there is no match). Individual
    implementations can decide what the default behavior is, and how to turn the
    default behavior off.
* END.

@legendecas
Copy link
Member Author

@reyang it seems to me in the following case:

Library X:
  Class A:
    Meter("Foo")
      Instrument("Bar")
  Class B:
    Meter("Bar")
      Instrument("Bar")

The second instrument won't take effect in the "DIFFERENT instances" case since it is determined with view names? If that is true, it sounds to me that the behavior variation still exists between returning "SAME instance of Meter" and "DIFFERENT instances of Meter".

@reyang
Copy link
Member

reyang commented Jan 11, 2022

The second instrument won't take effect in the "DIFFERENT instances" case since it is determined with view names? If that is true, it sounds to me that the behavior variation still exists between returning "SAME instance of Meter" and "DIFFERENT instances of Meter".

I don't get the question. Would you elaborate more? Thanks!

@reyang reyang assigned reyang and unassigned jsuereth Jan 11, 2022
@legendecas
Copy link
Member Author

In the above (#2226 (comment)) explanations, it is said when the SDK decided to return DIFFERENT instances for meters, the name conflicts are resolved when MeterProvider registered Views. As Views are not Meter scope but MeterProvider scope, in the case how do we define the identity of a metric instrument when the meter has different names:

Library X:
  Class A:
    Meter("Foo")
      Instrument("Bar")
  Class B:
    Meter("Bar")             // <= a second meter
      Instrument("Bar")      // <= instrument with same name in Meter("Foo")

Should the Instrument("Bar") of Meter("Bar") be resolved within MeterProvider's scope of Views, or should it be resolved in Meter's scope?

@reyang
Copy link
Member

reyang commented Jan 12, 2022

I see! Thanks for the clarification!

This is an ongoing discussion which we haven't yet solved.
Please refer to #2017, #2035, open-telemetry/opentelemetry-dotnet#2634 and https:/open-telemetry/opentelemetry-dotnet/blob/18d73c59f021b1c7a9bd433ea703e6f2c22a0b90/examples/Console/TestPrometheusExporter.cs#L63-L71

My understanding is:

  • The API and SDK would allow such "duplicated metric names" to be delivered to the reader/exporter.
  • For OTLP exporter, it is not going to be a problem since OTLP supports a "meter" concept.
  • For Prometheus scenario, we should hammer out the data model (Initial draft of Prometheus <-> OTLP datamodel specification. #2017) and exporter spec. This is not done yet.

@jack-berg
Copy link
Member

@reyang you discussed the following example in your comment:

Library X:
  Class A:
    Meter("Foo")
      Instrument("Bar")
  Class B:
    Meter("Foo")
      Instrument("Bar")
  • If the language implementation decided to return the SAME instance of Meter("Foo"), the 2nd instrument creation would fail according to the API spec.
  • If the language implementation decided to return DIFFERENT instances of Meter("Foo"), both instruments will get created and can be used to report measurements.

I think we need to adjust the wording of the spec to relax the requirement that "2nd instrument creation would fail according to the API spec". The java SDK returns the same meter as long if the name, version, and schemaUrl match a previously registered meter. And it also allows the same instrument (same name, description, units, instrument type) to be registered from a meter multiple times, as long as the name, description, unit, and instrument type match. If you register an instrument that is in conflict with a previously registered instrument, we log a warning, and values recorded to the instrument are dropped.

This behavior seems acceptable for synchronous instruments. It seems like it could also be acceptable for asynchronous instruments if we decide that if multiple callbacks are registered, each is called.

@jack-berg
Copy link
Member

The Java SIG has been talking and its become clear that the API spec's current requirement that "Meter implementations MUST return an error when multiple Instruments are registered under the same Meter instance using the same name" is very detrimental to java's instrumentation.

It's common to ask for the same meter multiple times, and for the same instrument from that meter multiple times. Requiring that the API return an error in this situation reduces API usability, forcing users to think carefully about instrument initialization ordering details and requiring clunky code for correctness.

#2270 alleviates much of the concern, but there is still asymmetry between synchronous and asynchronous instruments. #2281 resolves the asymmetry, but with multiple callbacks, we find ourselves wanting / needing the ability to delete or unregister callbacks (#2232), else apps risk accumulating many callbacks without a way to release resources.

@anuraaga
Copy link
Contributor

Thanks @jack-berg. To add some insight, one way to think about this is that we have semantic conventions that will be shared among multiple components. For example, we have http.duration as a convention - if a user has multiple OkHttp HTTP clients, they will register instrumentation to each of those clients. It is the instrumentation that initializes instruments, naturally since users shouldn't have to implement semantic conventions themselves. But it's mostly impossible to ensure that the same convention has an instrument instantiated only once, unless the instrumentation itself keeps some state to make sure to return the same instrument - this just pushes the spec non-compliance from the SDK into instrumentation.

Anecdotally, I've worked with an internal metrics framework that did prevent duplicate registration and had a flag to suppress the exception - almost every app enabled the flag. It's probably for reasons similar to the above.

Sorry for not noticing the gap between the Java implementation and the spec until now - but this does seem to be an issue with the spec and now is probably a good time to fix it. Otherwise Java would probably launch with this non-compliance, expecting it to be fixed forward in a future release since we don't really have a good way of preventing duplicate registration right now.

@mateuszrzeszutek
Copy link
Member

... with multiple callbacks, we find ourselves wanting / needing the ability to delete or unregister callbacks (#2232), else apps risk accumulating many callbacks without a way to release resources.

I'd like to point out that this is still a valid concern even without multiple callbacks support - you can still create lots of async instruments that'll leak resources (although it is maybe a bit less plausible).

... the instrumentation itself keeps some state to make sure to return the same instrument - this just pushes the spec non-compliance from the SDK into instrumentation.

I was forced to do exactly that in my micrometer integration in the Java instrumentation repo - with the one-callback limitation in place it was impossible to use plain OTel metrics API to implement that. I believe that not being able to register multiple callbacks is going to block us from implementing further metrics instrumentations - for example, the database connection pool instrumentations that I recently created a spec PR for can't possibly happen without being able to register (and remove) multiple callbacks.

@jsuereth
Copy link
Contributor

As an FYI - when implementing the specification for the Java SDK (and when initially reviewing) I interpreted the following:

Meter implementations MUST return an error when multiple Instruments are
  registered under the same Meter instance using the same name.

As this:

Meter implementations MUST return an error when multiple Instrument types are
  registered under the same Meter instance using the same name.

Specifically, I though this segment of the API was to prevent obvious logic/typing errors in metrics for instrumentation authors. I had to read through this whole issue to realize I still consistently mis-read the spec as such.

Would it be feasible for us to just limit multiple instrument types (e.g. not allowing a counter + gauge to register with the same name) and consider this a 'bug fix' to the APi specification?

@anuraaga
Copy link
Contributor

anuraaga commented Jan 31, 2022

Would it be feasible for us to just limit multiple instrument types (e.g. not allowing a counter + gauge to register with the same name) and consider this a 'bug fix' to the APi specification?

This seems like a nice improvement to the current wording. But thinking about it, I wonder if even validation of types may be more harm than good. We have to keep in mind that the restriction only applies within a Meter. In that case, I suspect it will be extremely rare for instrumentation library authors to run into this situation because they will use either an instrumentation API provided by us, or assuredly will have some abstraction within their codebase to avoid even having a situation where they could call an instrument builder multiple times with different types. It's always possible, but I doubt the check is really aimed at these authors.

It is probably aimed more at app authors. In that case, failing with duplicate types seems to increase the attack surface of OTel's existing trap, the UpDownCounter vs Gauge. If one queue.size uses one or the other within the same app, when it compiles but fails to start I think users will be quite confused. I'm having trouble being sure if a statement like "UpDownCounter should always be used for queue.size" is even correct, there are so many different types of queues in the world.

Another sharp edge I can think of is there will be cases, possibly many, where a company's internal libraries don't care about the meter name because it's not used by their backend, and just use empty string, or maybe the company name or something. This isn't compliant with our spec, but they're internal and shouldn't need to be, they are just interested in getting the instrument to get some metrics. But such practices would increase the attack surface of the trap even further, if it's an internal library consumed by the app, the app will generally have no good solution to this error.

I have a feeling that such confusing situations will come up more often then the user being saved by the validation, though I am not sure of this. We need to keep in mind that many, probably most, users of our APIs, especially metrics since it's commonly used within app code as opposed to tracing more commonly consumed through instrumentation, will not be even aware of the opentelemetry specification let alone be happy if it constrains them too much.

@legendecas
Copy link
Member Author

legendecas commented Jan 31, 2022

FWIW, #2270 is adding resolutions to duplicated instruments, which is based on the type and unit of the instruments. Error is returned on the instrument registration with inconsistent type and unit.

@jack-berg
Copy link
Member

It is probably aimed more at app authors. In that case, failing with duplicate types seems to increase the attack surface of OTel's existing trap, the UpDownCounter vs Gauge. If one queue.size uses one or the other within the same app, when it compiles but fails to start I think users will be quite confused.

If a violation occurs we shouldn't we should return a no-op instrument and log it. Truly erroring when these situations occur would be an especially bad experience for apps which create new instruments throughout the app's lifecycle. E.g. the app is running fine for days but at some point the conditions are met for the creation of a new instrument and the app crashes - yikes.

Are you suggesting that we try to accommodate having a meter support instruments with the same name but different types? If so, I don't think this is a good idea. I think we ought to log an error and return a no-op instrument. I interpret logging an error as in compliance with the spec's wording of Meter implementations MUST return an error ....

@anuraaga
Copy link
Contributor

anuraaga commented Feb 1, 2022

Are you suggesting that we try to accommodate having a meter support instruments with the same name but different types? If so, I don't think this is a good idea. I

Yeah that is my idea - if attributes were also part of identity then I think the chance of problems goes down but I outlined in my comment the reasons I feel like any sort of dedupe seems like it will cause more problems then help for users in the end. The main issue for me is that a name, e.g. something plain like queue.size, doesn't seem to be intrinsically typed in every single case. Semantic conventions only help a little since user code will not nor should not need to comply with semantic conventions.

@jsuereth
Copy link
Contributor

jsuereth commented Feb 1, 2022

Are you suggesting that we try to accommodate having a meter support instruments with the same name but different types? If so, I don't think this is a good idea. I

Yeah that is my idea - if attributes were also part of identity then I think the chance of problems goes down but I outlined in my comment the reasons I feel like any sort of dedupe seems like it will cause more problems then help for users in the end. The main issue for me is that a name, e.g. something plain like queue.size, doesn't seem to be intrinsically typed in every single case. Semantic conventions only help a little since user code will not nor should not need to comply with semantic conventions.

I understand this sentiment, but it's really mostly valid in a typeless backend (like Prometheus). One of the key-tenants of OTel (and OpenMetrics) is typed metrics, as well as "default aggregations".

I know for a fact that this suggestion would break some backends (as well as OpenMetrics conventions). e.g. even InfluxDb one of the most flexible, if you have a schematized bucket this would be an error.

Additionally, we have an issue where 1 timeseries in OTEL != 1 timeseries in Prometheus (or other backends) e.g. Specifically, Histograms turn into a family of metrics in prometheus. Allowing the same metric name here could cause issues. I also suspect if we have type-conflicts, we can't really guarantee we won't also have label conflicts.

What you're suggesting is a (large) data model specification change. I think it would require a siginificant rewrite of the metric specification, starting from the data model.

I suspect it will be extremely rare for instrumentation library authors to run into this situation because they will use either an instrumentation API provided by us, or assuredly will have some abstraction within their codebase to avoid even having a situation where they could call an instrument builder multiple times with different types. It's always possible, but I doubt the check is really aimed at these authors.

You're right. We can mitigate the error message with auto-generated helper libraries. I'd argue that folks will still interact with the API so we shouldn't leave sharp edges for them even if we expect to provide nicer solutions over time.

Another sharp edge I can think of is there will be cases, possibly many, where a company's internal libraries don't care about the meter name because it's not used by their backend, and just use empty string, or maybe the company name or something. This isn't compliant with our spec, but they're internal and shouldn't need to be, they are just interested in getting the instrument to get some metrics. But such practices would increase the attack surface of the trap even further, if it's an internal library consumed by the app, the app will generally have no good solution to this error.

This is the primary reason why we want this error message. It's a sharp edge in our design. I'd love to find ways to reduce this friction, but I think a warning message and reporting as many metrics as possible is the best we can do right now in this situation. There are some scenarios we need to treat as a "bug", and I think our design and documentation should highly encourage others to pull a new Meter (or allow a Meter to be passed in) for their libraries. We've sort of built the API around this notion.

In fact, if we got rid of the importance of Meter, we'd still have timeseries overlap as a sharp-edge. We wouldn't be able to get rid of the warnings, we'd just see them more often. Right now, we've included Meter as part of the metric identity which allows as at least SOME way to avoid conflicts across groups in a large organization. If we get rid of this, I think it just makes the situation worse.

@anuraaga
Copy link
Contributor

anuraaga commented Feb 1, 2022

Thanks @jsuereth - framing it as a sharp edge in the data model itself makes sense to me and agree that there isn't much we can do at a higher level like the SDK to work around that.

@jmacd
Copy link
Contributor

jmacd commented Feb 2, 2022

@anuraaga

Yeah that is my idea - if attributes were also part of identity then I think the chance of problems goes down but I outlined in my comment the reasons I feel like any sort of dedupe seems like it will cause more problems then help for users in the end.

The setup for this scenario is a large code base with existing instrumentation APIs that the owner wishes to adapt to use OTel. Let's say the instrument is one of these ambiguous Prometheus-style Gauges that could be up to three different instruments to the user unfamiliar with OTel: it has Set(), Add(), and Sub() methods, and there are legitimately parts of the large code base where queue sizes are measured using Add/Sub and other parts where queue sizes are legitimately observed using Set().

So without a discriminating instrumentation library in this scenario, let me suggest using one instrumentation library per kind of instrument in your code base. E.g., "mycompany.com#counters", "mycompany.com#gauges", "mycompany.com#histograms", and "mycompany.com#updowncounters". This seems like no harm done, provided the receiver of the OTLP is willing to recognize instrumentation library as a distinguishing feature.

I want to point out that I support a mixture of temporality for the same instrumentation library name and metric name across a system. I.e., OK to have synchronous Counters here and asynchronous Counters there, because these are semantically compatible and can be changed either by exporter configuration (e.g., expressing a temporality preference for performance reasons) or by re-instrumentation (e.g., replacing synchronous Counters with asynchronous Counters). However, in this case, I would expect to see Resource attributes--part of the stream's identity--to serve as a distinguishing feature.

I do not like the idea of having attributes or resource attributes be identifying in this way. If pressed to do something more accommodating, which is what I feel you're doing, I'd rather allow the instrument kind to be considered part of the identity. If you happen to report a Counter and a Gauge with the same name, that would be fine I suppose, as long as they're defined as having distinct metric identities.

@jmacd
Copy link
Contributor

jmacd commented Feb 2, 2022

Exploring the compromise I hinted at above, I see a solution where we tolerate duplicate-conflicting instruments.

I'd rather allow the instrument kind to be considered part of the identity

I'm not suggesting we change OTLP or the data model. This is a compromise to make the SDK simpler and "pass along" the ambiguity when duplicate-conflicting instruments are registered.

For SDKs this is a simple change, just include the instrument kind in whatever identity map you have. There's no need to return a "registration error" from the API, as a result. Does this still deserve a warning? Yes, because it will generate errors for certain metric-type-aware vendors.

I would continue to say that users SHOULD NOT re-use instrument names for the different-semantic-kind instruments. This is not good, we're just tolerating it within the SDK specification.

I would not recommend consumers of OTLP change in response to this, unless it is to pass through the same. For example in an OTel Collector metrics processor stage, where an identity map is being built for some purpose, it will be natural to just pass-through the ambiguity by recognizing them as distinct metrics.

Speaking for my employer, for example, Lightstep will experience an error (*) if the sender writes a different type of metric stream than its type on the first arrival. Errors will result when the SDK "passes along" the ambiguity described above to Lightstep. We want metrics to have types for a lot of reasons, but the one that matters most to users is that our query builder is type-aware. We build different sorts of queries about counters as we do about gauges, and we believe the user is helped by having this type awareness build in to the product.

There's one exception to the Lightstep example above, we will not complain if you switch the type between UpDownCounter and Gauge. So, you may ask why not? It's because we're not recognizing the distinction; however, this decision has created problems in our query builder that would be fixed if we'd begin to recognize the distinction. (Here it is: (1) When your Gauge query matches multiple timeseries and you need an action to reduce those series into one series, the user is best served to Average those series by default; (2) When your UpDownCounter query matches multiple timeseries and you need an action to reduce those series into one series, the user is best served to Sum those series by default. By not recognizing the distinction, we were forced to choose one default for the two underlying instruments. We chose to Sum timeseries by default for the UpDownCounter/Gauge type, and well, that's not always the best for the uyser.

@jmacd
Copy link
Contributor

jmacd commented Feb 2, 2022

(*) I wrote "experience an error" because this condition is difficult to return to the user so that they will see it. We don't want to fail all the metrics that arrived in the same batch, and we don't want to signal anything that looks retry-able. So, we return gRPC or HTTP response trailers to tell the user about this sort of conflict and only the original-type metric stream will continue writing.

@anuraaga
Copy link
Contributor

anuraaga commented Feb 2, 2022

@jmacd Your proposal sounds good to me, as your example shows the type + name identity may be important for certain backends but may not be - logging a warning to the user but still passing through the ambiguity instead of explicitly failing or forcing a no-op will still allow users some flexibility / time if they do have issues or know their backend won't be affected.

@jack-berg
Copy link
Member

To confirm my understanding, the proposal is to allow creating instruments with the same name and different types from the same meter. If this occurs, the resulting exported metrics would contain metrics with the same name and different types, and its up to the backend to figure out with them.

For example, the following code would be allowed:

meter.counterBuilder("foo").build().add(10);
meter.counterBuilder("foo").buildWithCallback(measurement -> measurement.record(10));
meter.gaugeBuilder("foo").buildWithCallback(measurement -> measurement.record(10));

And on export would produce something like:

{name: "foo", type: "sum", dataPoints: [{value: 10}]} // from the sync counter "foo"
{name: "foo", type: "sum", dataPoints: [{value: 10}]} // from the async counter "foo"
{name: "foo", type: "gauge", dataPoints: [{value: 10}]} // from the gauge "foo"

Is that an accurate representation? If so, what are the implications on prometheus export?

@jmacd
Copy link
Contributor

jmacd commented Feb 2, 2022

If so, what are the implications on prometheus export?

TL;DR Lengthy response in support of my own argument based on experimenting with Prometheus.

First, a Prometheus library (Golang) returns error if you try to register the same instrument name, regardless of identical instrument kind or not. If you're using MustRegister() you'll panic, if you're not you get an error and the instrument appears to become a no-op.

Next, I produced a fake scrape target w/ duplicates-- a counter and a gauge with the same name.

# TYPE test counter
test{label="test1"} 6

# TYPE test gauge
test{label="test1"} 1 

I tried this with and without using the same label set, but in both cases Prometheus doesn't care. I tested with both Prometheus v2.23 and v2.33. It appears to always choose the first value to appear in the scrape. No warnings were emitted on the console.

Historically, the internal representation of Prometheus doesn't depend on instrument metadata such as this, as far as I know, so whether you report a counter or a gauge you end up with a timeseries of numbers. This is the sense in which @anuraaga's request above ("use labels for identity") is realistic: as long as the user uses different labels on the different same-name instruments, the data is preserved and can be used--technically the single-writer rule is not being broken. However, this seems to introduce confusion that could've been prevented earlier (e.g., with a registration error, as we're discussing). We may want to ask Prometheus developers for more context. Like the current OTel data model specification, the OpenMetrics specification says that metric family names MUST be unique in a metric set.

I am aware of optional support in Prometheus to include instrument metadata in the PRW data stream (see, e.g., prometheus/prometheus#6815, prometheus/prometheus#7771); I am not sure what would happen with that support turned on, but it appears that the tide is shifting on this topic. The reason Prometheus is adding metadata to PRW is because vendors are demanding it, which is because to build a scalable Prometheus data store you some times want to automatically remove labels, and when you're automatically removing labels we get back to one of OpenTelemetry Metric's tenets: there's a natural merge function that should be applied and if you don't know Counter vs. Gauge you can't automatically merge timeseries.

@jsuereth
Copy link
Contributor

jsuereth commented Feb 2, 2022

I tried this with and without using the same label set, but in both cases Prometheus doesn't care. I tested with both Prometheus v2.23 and v2.33. It appears to always choose the first value to appear in the scrape. No warnings were emitted on the console.

Did you do this via the client library or via the prometheus backend? The prometheus backend is untyped, so effectively the type label is just ignored, I'd be surprised if it had failed to scrape. However, at least in Java, there's SOME restrictions in the prometheus client library, e.g. you can't expose the same metric name with different label sets for the whole process/endpoint.

When it comes to types in Prometheus, given the direction OpenMetrics is going, I'd like for us to prefer preserving types and having a "strict" mode that abides by best practices. However, we also need to deal with the real world and give users every out we can.

I also think (or assume?) we designed the perfect solution to the problem of different typed instruments: Views.

Specifically, my straw man scenario:

  • As @anuraaga suggests, we have a large company registering instruments to the "default" Meter (or unnamed).
    • One team is exposing a latency metric as a Gauge.
    • One team is exposing a latency metric as a Histogram.
  • By DEFAULT the API should not fail in this scenario. The end user may belong to neither of the teams in conflict and jus wants to get their job done.
    • (Option 1) We report an error on ONE of the latency metrics and report the other.
    • (Option 2) We send both metrics to Exporters and force them to deal with Data-Model violations coming form the SDK (we had taken a hard stance of attempting to disallow these from reaching Exporters to simplify writing an Exporter).
      Note: I'd actually like to expose both option 1 and 2 via a "strict" configuration enforcement of our datamodel.
  • Users can define a View which hijacks, by instrument type, one of the offending metrics and brings it into either a compatible type or a new namespace, leading to no error messages or conflicts.

the original Java view implementation was designed with the flexibility to allow views to resolve metric-name conflicts for THIS EXACT SCENARIO. I think that may have been undone (cc @jack-berg for confirmation) to abide by the strict definition in the specification.

EDIT:

Additionally I think we COULD even specify an error message that explains to users HOW to register a view to remove the conflict. E.g.

WARNING: Found duplicate metric definition: latency
Original defined as Histogram at MySourceFile.java:123
Also defined as Gauage at TheOtherSourceFile.java:456

You can remove this conflict by renaming one of the metrics with a view, e.g.

  sdkMeterProviderBuilder.registerView(
      InstrumentSelector.builder().setInstrumentType(InstrumentType.GAUGE).setName("latency").build(),
        View.builder()
            .setName("other_latency")
            .build());

@jmacd
Copy link
Contributor

jmacd commented Feb 2, 2022

Thank you @jsuereth -- especially for connecting this problem with a Views solution. (Nice suggestion that we print a Views configuration to fix!)

This suggests I should close #2270 and #2281 and work on combining these revisions with a new position.

Can we separate the notion of a "strict mode" from this problem, specifically? I can imagine other forms of strictness, and we might be able to agree on a general statement in the library guidelines. Then we'd write something on this topic in the API spec like duplicate instrument registration with different instrument kind (including synchronous vs. asynchronous) MUST be permitted, except in strict mode where duplicate-conflicting registration is considered a runtime error.

For the SDK spec, the implementation SHOULD print instructions on how to resolve duplicate-conflicting registration errors by printing an example View configuration that selects on instrument type.

For an OTel-Go implementation, e.g., strict mode would even prevent registration of both int64 and float64 instruments of the same name and type.

@jack-berg
Copy link
Member

This proposal makes sense to me. Thanks @jmacd and @jsuereth for the additional context.

There's one scenario I still have a question about. Suppose a user asks a meter for the same instrument type, and same name, but a different unit or description:

meter.counterBuilder("foo").setDescription("first").build().add(10);
meter.counterBuilder("foo").setDescription("second").build().add(10);

The SDK could extend the proposal to propagate ambiguity to the exporter as follows:

{name: "foo", type: "sum", description: "first", dataPoints: [{value: 10}]} // from the counter "foo" w/ description "first"
{name: "foo", type: "sum", description: "second", dataPoints: [{value: 10}]} // from the counter "foo" w/ description "second"

Or it could attempt to merge and resolve the ambiguity before export, for example by taking the first unit / description registered:

{name: "foo", type: "sum", description: "first", dataPoints: [{value: 20}]} // aggregate of counter "foo" with descriptions "first" and "second"

@jsuereth
Copy link
Contributor

jsuereth commented Feb 3, 2022

@jack-berg that's a good call out. I'm a lot more biased against failing for different units and descriptions. I think the suggestion of "first one wins" is good. We could also do "first non-default wins, or use the default", however that's a bit riskier.

@jack-berg
Copy link
Member

First one wins makes most sense to me because any other approach implies that a metric's description / unit is subject to being updated from one export to another, which feels wrong. If a user wants a certain description and unit, they can simply take care to make sure that all calls to create the instrument include the same description / unit.

@jmacd
Copy link
Contributor

jmacd commented Feb 4, 2022

@jack-berg and @jsuereth about the unit and description field.

I don't want to require the SDK to handle unit conversion, so different units have to be treated as different streams.

If a user provides two descriptions, it's easy to make an exception and say "first wins" for description. It's probably even easier to say that all fields are identifying, so that two descriptions means two streams with a warning about a conflict. If the user is really trying to register the identical metric stream, it should be easy for them to supply the identical description.

If this happens and a user configures a Prometheus exporter, they'd SHOULD be using distinct attributes, otherwise should work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory
Projects
None yet
8 participants