Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetricReader and Multi-export features #3578

Merged
merged 35 commits into from
Sep 30, 2021

Conversation

jsuereth
Copy link
Contributor

@jsuereth jsuereth commented Sep 5, 2021

Adds MetricReader interface, and expands Java SDK to allow multiple exporters to be registered.

  • Implements multi-export for SYNCHRONOUS INSTRUMENTS. ASYNC work as-is but are currently broken for DELTA aggregation (in a trivial to reproduce way).
  • Adds bare minimum tests or evolution of tests.
  • Implements the following exporters:
    • OTLP (as-is)
    • InMemory (shifted to use MetricReader)
    • Prometheus (shifted to use MetricReader)
    • PeriodMetricReader (shifted from IntervalMetricReader + adapted to MetricReader interface)
  • Updates autoconf to use new style of registration (albeit this can be improved).

Implementation notes

The key interfaces are as follows:

interface MetricProducer {
   Collection<MetricData> collectAllMetrics();
}
interface MetricReader {
    CompletableResultCode flush();
    CompletableResultCode shutdown();
}
interface MetricReaderFactory {
      MetricReaer apply(MetricProducer producer);
}
interface MetricExporter {
   CompletableResultCode export(Collection<MetricData> batch);
   CompletableResultCode flush();
   CompletableResultCode shutdown();
}
interface SdkMeterProviderBuilder {
   SdkMetricProviderBuilder registerMetricReader(MetricReaderFactory readerFactory);
}

Effectively this accomplishes the following:

  • Shutdown behavior, similar to TracerProvider, is available directly on the MeterProvider.
  • SdkMeterProvide can know (with 100% certainty) how many readers there are and can influence their lifecycle (via flush/shutdown).

Tracking metric collection

We introduced CollectionHandle which provides bit-set functionality (and integer equality) for tracking metric reader and reseting mutable interval-based state based on metric reader. Effectively every metric reader is given a CollectionHandle which is a bit-value within the integer range (limiting us to 32 metric readers per sdk).

SynchronousMetricStorage

  • DeltaMeasurementStorage
    • We have a mechanism (on every collection call) to grab current delta measurements from synchronous instruments.
    • Every measurement remembers which collection handles have read the measurement
    • This storage is automatically cleaned once a measurement has been read by all handles.
  • CumualtiveMetricStorage
    • We have another mechanism to remember the last reported metric point per-collector

The basic algorithm is:

  • Go grab recent delta measurements.
  • Merge all unread delta measurements for this collector
  • Merge delta-measurement w/ previous cumulative, if reporting cumulative.
  • Perform garbage collection

AsynchronousMetricStorage

This will be updated to track previous values for "mergeDiff" of Sum aggregation (or log errors for DELTA temporality depending on specification).

if (exportInterval != null) {
readerBuilder.setExportIntervalMillis(exportInterval.toMillis());
if (exportInterval == null) {
// TODO: What default for reading?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current spec PR - "The default value is 5000 (milliseconds)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's quite.... fast. The default in OpenCensus was 1 minute. Should we slow it down a bit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion, I got the value (5000ms) from the tracing SDK spec for consistency (and I think consistency here is not something super important).

Note the difference - for span processor, the default 5000ms is the delay between two consecutive exports, so the actual interval is something like 5000ms + exporting time, in metrics we are taking different approach by using a fixed schedule (so the delay is 5000ms - exporting time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's also a huge difference in batch size + aggregation between metrics + traces.

Traces are known to be high-frequency + highly sampled. Metrics we're leveraging aggregation to limit bandwidth, so the export interval should be different. Given what I know of our customers + vendor requests around metrics, I think a default of 1 minute should accommodate most users (and be too fast for some).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that for Microsoft most of the systems use either 20 seconds or 1 minute so I'm totally fine with 1 min.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR to change it to 60 seconds open-telemetry/opentelemetry-specification@68f4fdc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Having to wait a full minute, on top of ingest time for the backend is a very long time to wait for metrics to appear on new deployments. What's the harm in having it be faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OpenCensus there was the agility to have a different delay between when the first metrics are exported and the average time between each export.

My thinking on this is as follows:

  • For short-lived processes, the shutdown hook + export is crucial, and one of the benefits of push-metrics.
  • For most other metrics, one of the reasons to use metrics OVER logs or traces is because you can greatly shrink the sheer size of the data. If we report too frequently we put more pressure on backends to collapse the data or risk being very expensive for users.
  • For scenarios where we're serving HTTP traffic, if we send too frequently, we're unlikely to have explored/opened all metric-streams within the reporting interval. This actually has memory/gc implications for the implementation.
  • For initial-startup "is it working" scenarios, it probably makes sense to default a quick fire of metrics

TL;DR:

  • Our default should balance "Most metric streams have a point recorded during this time" and "Users don't run a huge bill on storage", "Users can see changes live" and "Nice for debugging"
  • I think the "is it working on startup" use case is enough to shorten time-to-initial report, but not the on-going reporting interval. There's some other nice benefits we may be able to get here like restart-loop detection. However I think the cost of reporting too frequently is a real thing. Users should opt-in if they want higher resolution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really cool to have a dynamic system that would send metrics more frequently during unusual circumstances (like startup), but I have no idea how one might do that. :)

… SdkMeterProvider

- InMemoryMetricReader is now a factory + reader
- Hide register on SdkMeterProvider
- Update all tests to make sure things pass.
…. Update all code around registration of readers and autoconfigure. Create new 'sdk noop' meter provider for when we have no readers configured.
@jsuereth
Copy link
Contributor Author

@jkwatson @anuraaga This is ready for an "API" evaluation. Still some known issues in the implementation I'm working through, but would love thoughts on the API before I dive too much further.

@jsuereth jsuereth changed the title DRAFT - Working multi-exporter + MetricReader to evaluate spec proposal DRAFT - MetricReader and Multi-export features Sep 18, 2021
private final MetricProducer metricProducer;
public final class PrometheusCollector extends Collector
implements MetricReader, MetricReader.Factory {
private MetricProducer metricProducer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile just to be sure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be necessary as this will be filled out before the class is registered w/ prometheus.

We can add it, or I can document the reason it's ok not to be volatile here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation on why, PTAL and let me know if you'd still like volatile here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I read your description and I'm not convinced. :) The JMM is a tricky beast and visibility of non-final, non-volatiles is likely to trip one up at the worst possible time (in production). Is "shared with other threads" a concrete JMM thing for non-final, non-volatiles?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have PrometheusCollector.createFactory() which returns a MetricReaderFactory, not MetricReader? The factory can initialize the MetricReader with a concrete reference without needing to make the MetricReader mutable I think. createFactory could just return PrometheusCollector::new presuambly

Comment on lines 147 to 148
// Register the reader (which starts it), and use a global shutdown hook rather than
// interval reader specific.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registering the reader with the builder doesn't start it. It doesn't get started until the SdkMeterProvider is actually built, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll update the comments.

Comment on lines 159 to 160
// TODO: Can we move this portion into the PrometheusCollector so shutdown on SdkMeterProvider
// will collapse prometheus too?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the config reading part, but the server creation/startup makes sense to move there, yes.

import javax.annotation.Nullable;

/**
* {@code SdkMeterProvider} implementation for {@link MeterProvider}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MeterProvider isn't imported here, so the javadoc doesn't resolve the link properly.

Comment on lines 66 to 73
Supplier<CollectionHandle> handleSupplier = CollectionHandle.createSupplier();
for (MetricReader.Factory readerFactory : readerFactories) {
CollectionHandle handle = handleSupplier.get();
// TODO: handle failure in creation or just crash?
MetricReader reader = readerFactory.apply(new LeasedMetricProducer(handle));
collectors.add(handle);
readers.add(reader);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't much like doing all this work in the constructor. Can we move this stuff to an init method of some sort to be called after construction is complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be called immediately after construction always anyway?

If we want immutable / final references to this config, it needs to be in the constructor. I moved it this way to avoid having a "register" method be publicly exposed the could dynamically adjust configuration (and require locks). We can shift this around, but having another method that mutates state feels just as bad as exposing "register" to me.


@Override
public CompletableResultCode shutdown() {
// TODO - prevent multiple calls.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just ignore subsequent ones, returning the appropriate result code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, do we have an example of that impl I can copy-paste?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh. I looked through the code and found a variety of approaches, none of which look great. It will probably involve saving off the CompletableResultCode somewhere, and just returning it if it's already there.

@jsuereth jsuereth marked this pull request as ready for review September 24, 2021 15:13
@jsuereth jsuereth requested a review from a user September 24, 2021 15:13
@jsuereth jsuereth changed the title DRAFT - MetricReader and Multi-export features MetricReader and Multi-export features Sep 24, 2021
@jsuereth
Copy link
Contributor Author

Not sure how to solve the codecov issues, it seems like it's not seeing much coverage on existing exporters + autoconfigure.

}

/** Our implementation of the metric reader factory. */
private static class Factory implements MetricReaderFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but looks like can be singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a refactoring (I can do in this PR or later) where we instantiate the HTTP server as part of the factory (configurable). Would you like that in this PR? I'll add a note in the meantime.

@@ -21,20 +34,45 @@

@Override
public List<MetricFamilySamples> collect() {
if (metricProducer == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this and add requireNonNull in the constructor instead (construction of SDK classes can generally throw)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this shouldn't remain, after the refactoring anyway. It "can't" be null if all our @Nullable annotations are accurage.

private final ComponentRegistry<SdkMeter> registry;
private final MeterProviderSharedState sharedState;

// These are *only* mutated in our constructor, and safe to use concurrently after construction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can model this in code better by initializing the variables in the collector as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


private static final Logger LOGGER = Logger.getLogger(DefaultSdkMeterProvider.class.getName());
static final String DEFAULT_METER_NAME = "unknown";
private final ComponentRegistry<SdkMeter> registry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final ComponentRegistry<SdkMeter> registry;
private final ComponentRegistry<SdkMeter> registry;

Collection<SdkMeter> meters = registry.getComponents();
List<MetricData> result = new ArrayList<>(meters.size());
for (SdkMeter meter : meters) {
result.addAll(meter.collectAll(handle, collectors, sharedState.getClock().now()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

* </code></pre>
*/
public class InMemoryMetricReader implements MetricReader, MetricReaderFactory {
// Note: we expect the `apply` method of `MetricReaderFactory` to be called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for testing think it's worth sticking to the same pattern we did to Prometheus (here it really should just be returning ::new :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't, you loose access to the instantiated class if you do so, which means you never have access to collect metrics.

This is designed so you can directly call "collect" and extract metrics. (also, should this class be in the metric-testing artifact?)

* <p>All synchronous handles will be collected + reset during this method. Additionally cleanup
* related stale concurrent-map handles will occur. Any {@code null} measurements are ignored.
*/
@GuardedBy("this")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is supposed to be on fields, not on methods (and errorprone will ensure the marked fields are only accessed under said lock)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did work for me in VisualStudio, but I can remove if you'd like.

Copy link
Contributor

@anuraaga anuraaga Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working means if a not synchronized method accesses the ArrayList without taking the lock it fails the build. Is vscode really able to understand that with this pattern? And is Gradle able to? The requirement is for Gradle to catch it, and if it can then I'm missing parts of the behavior of this annotation so happy to know that.

Copy link
Contributor

@anuraaga anuraaga Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, think I'm just waiting on this one to merge, don't want to get confused later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can merge after filing an issue about it if it seems hairy, since it's about thread safety want to make sure it doesn't get lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize this was still open (sorry soo many comments I was missing things). Just went with synchronized for now and we can rework the class internals over time (there's a few TODOs for it anyway).

}
}
// Now run a quick cleanup of deltas before returning.
unreportedDeltas.removeIf(delta -> delta.wasReadyByAll(collectors));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, one of the most significant changes from our current model is that rather than clearing all data every collection, we mark each time data has been read, and only after it's been read by all collectors do we clean up. Just want to confirm my understanding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We store semi-aggregated data every time any reader wants to collect metrics, then we reconstitute the full-delta per-reader and join w/ any cumulative storage before reporting. There's a design "markdown" from Reiley that kind of clarifies this, and I can take a crack at something. Hoping the new naming helps make this more clear.

Effectively the goal of this class is to get the Synchronous collection path (that's highly performance sensitive) into the reader thread as fast as possible. We don't care too much about reader contention because we don't expect significant numbers of reader threads.

A few improvements over time:

  • I'd like to pre-allocate delta storage and limit its size. That should be hidden in this class and would be per-metric.
  • I'd like to allow multiple reader "within interval" to avoid re-collecting synchronous metrics.
  • I'd like to join multiple MetricExporters into a single MetricReader if they all use periodic metric reader.

I can open tickets around these improvements, but wanted to make sure we're ok with how this code works now.

*
* @return the result of the flush, which is often an asynchronous operation.
*/
CompletableResultCode flush();
CompletableResultCode forceFlush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against renaming this method. It breaks our existing API compatibility guarantees and makes the method out of alignment with the SpanExporter interface. Is there a strong reason why we're adding the "force" to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe I'm misunderstanding some comments/conventions here.

I migrated all uses of flush to forceFlush to conform with SdkTracerProvider. It looks like exporters use flush and the SDK*Provider uses forceFlush.

I'll revert this specific change. Do we want SdkMeterProivider to exposes forceFlush or just flush? From my perspective each is equally good, so if flush was exposed in a stable API, happy to change this.

Copy link
Contributor Author

@jsuereth jsuereth Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original comment (I likely took way too far): #3578 (comment)

For clarity on what we have:

  • *Exporter has:

    • flush()
    • shutdown()
  • Sdk*Provider has:

    • forceFlush()
    • close()
    • shutdown()

    If we're on board, I'll make sure metrics matches traces with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, for good or ill, let's at least have the parallel interfaces have the same methods. I honestly don't remember why exporters got flush when the providers got forceFlush.

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good enough to merge and polish as we go. :shipit:

@anuraaga anuraaga merged commit 70086cf into open-telemetry:main Sep 30, 2021
@jsuereth jsuereth deleted the wip-multi-exporters branch September 30, 2021 14:16
This was referenced Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants