Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk-metrics] Exemplar spec & perf improvements #5364

Closed

Conversation

CodeBlanch
Copy link
Member

@CodeBlanch CodeBlanch commented Feb 15, 2024

Changes

ExemplarReservoir

  • Offer method signature no longer includes the "index" parameter which is not part of the spec. Instead it now accepts struct ExemplarMeasurement which only exposes the value & tags (to match spec) but also tracks explicit histogram index as an internal thing.
  • Collect method signature no longer include "actualTags" or "reset" which are not part of the spec. The "reset" flag has become bool ResetOnCollect property on ExemplarReservoir. Tag filtering is now handled by struct ReadOnlyFilteredTagCollection which is exposed on Exemplar. Return type changed from Exemplar[] to struct ReadOnlyExemplarCollection.

Exemplar

  • Return type of FilteredTags property is now struct ReadOnlyFilteredTagCollection instead of List<KeyValuePair<string, object?>>?.
  • Now exposes long LongValue in addition to double DoubleValue.

SimpleFixedSizeExemplarReservoir

  • Default size changed from 1 to Environment.ProcessorCount which is allowed by the spec.

Concurrency

  • ExemplarReservoir methods made safe to be called concurrently (required by the spec). Also fixed a bunch of race conditions with collect. Need to prove this out more probably using Coyote tests (follow-up).
  • MetricPoint had dual-implementations of all aggregation types which acquired locks when exemplars were in play. This exploded the size/complexity of the code while also destroying throughput. Now that ExemplarReservoir can be called concurrently all of that has been removed.

Misc improvements

  • ReadOnlyExemplarCollection when enumerated only returns Exemplars which have been updated (Timestamp != default). Previously each exporter had to understand the behavior and check that.
  • Removed nasty Linq & allocations from Exemplar.FilteredTags.
  • Introduced internal FixedSizeExemplarReservoir to house the shared logic both SimpleFixedSizeExemplarReservoir & AlignedHistogramBucketExemplarReservoir share.

Benchmarks

  • When Exemplars are NOT enabled, perf is indistinguishable from before:

Before:

Method AggregationTemporality Mean Error StdDev Allocated
CounterHotPath Cumulative 10.33 ns 0.128 ns 0.107 ns -
CounterHotPath Delta 10.85 ns 0.169 ns 0.158 ns -

After:

Method AggregationTemporality Mean Error StdDev Allocated
CounterHotPath Cumulative 10.48 ns 0.136 ns 0.127 ns -
CounterHotPath Delta 10.71 ns 0.196 ns 0.184 ns -
  • The Exemplar benchmarks show slight improvement:

Before:

Method ExemplarFilter Mean Error StdDev Allocated
HistogramNoTagReduction AlwaysOff 176.0 ns 1.30 ns 1.15 ns -
HistogramWithTagReduction AlwaysOff 168.7 ns 3.40 ns 3.91 ns -
HistogramNoTagReduction AlwaysOn 214.2 ns 4.31 ns 6.84 ns -
HistogramWithTagReduction AlwaysOn 230.4 ns 3.77 ns 3.52 ns -
HistogramNoTagReduction HighValueOnly 192.7 ns 1.91 ns 1.59 ns -
HistogramWithTagReduction HighValueOnly 188.6 ns 3.72 ns 4.56 ns -

After:

Method ExemplarFilter Mean Error StdDev Allocated
HistogramNoTagReduction AlwaysOff 170.3 ns 1.54 ns 1.44 ns -
HistogramWithTagReduction AlwaysOff 164.3 ns 1.88 ns 1.67 ns -
HistogramNoTagReduction AlwaysOn 221.7 ns 3.29 ns 3.08 ns -
HistogramWithTagReduction AlwaysOn 214.4 ns 3.11 ns 2.91 ns -
HistogramNoTagReduction HighValueOnly 185.8 ns 3.12 ns 2.92 ns -
HistogramWithTagReduction HighValueOnly 177.1 ns 3.29 ns 3.23 ns -
  • Stress test shows massive gains:

Those Exemplar benchmarks above are very misleading. The Exemplar implementation before this PR forces all aggregation types into locking mode inside MetricPoint. This destroys throughput for all non-histograms. That really won't show up on benchmarks because that is one thread hammering measurements.

Here are some stress test results:

Before:

  • Total Runaway Time (seconds) 300
  • Total Loops: 4,080,616,288
  • Average Loops/Second: 13,613,263
  • Average CPU Cycles/Loop: 511
  • GC Total Allocated Bytes: 15181888

After:

  • Total Runaway Time (seconds) 300
  • Total Loops: 9,105,818,639
  • Average Loops/Second: 30,314,633
  • Average CPU Cycles/Loop: 1,843
  • GC Total Allocated Bytes: 15048856

That is running the stress test using a Counter<long> + a view filtering tags + AlwaysOnExemplarFilter + an OtlpExporter tweaked to send exemplars for counters transmitting every 1000ms.

This new Exemplar implementation blows away the old one due to the concurrency changes. The reduction in allocations is due to the FilteredTags improvements.

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes
  • Changes in public API reviewed (if applicable)

@CodeBlanch CodeBlanch added pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package metrics Metrics signal related labels Feb 15, 2024
Copy link

codecov bot commented Feb 15, 2024

Codecov Report

Attention: Patch coverage is 78.75536% with 99 lines in your changes are missing coverage. Please review.

Project coverage is 84.01%. Comparing base (6250307) to head (4512b49).
Report is 110 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5364      +/-   ##
==========================================
+ Coverage   83.38%   84.01%   +0.62%     
==========================================
  Files         297      283      -14     
  Lines       12531    11972     -559     
==========================================
- Hits        10449    10058     -391     
+ Misses       2082     1914     -168     
Flag Coverage Δ
unittests ?
unittests-Instrumentation-Experimental 24.58% <0.00%> (?)
unittests-Instrumentation-Stable 24.58% <0.00%> (?)
unittests-Solution-Experimental 83.97% <78.75%> (?)
unittests-Solution-Stable 83.96% <78.75%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...penTelemetry/Metrics/Exemplar/ExemplarReservoir.cs 100.00% <100.00%> (ø)
...trics/Exemplar/SimpleFixedSizeExemplarReservoir.cs 100.00% <100.00%> (ø)
src/OpenTelemetry/Metrics/Metric.cs 96.77% <100.00%> (+0.25%) ⬆️
...Telemetry/Metrics/MetricPointOptionalComponents.cs 100.00% <100.00%> (+22.22%) ⬆️
src/OpenTelemetry/Metrics/MetricReaderExt.cs 91.50% <100.00%> (+1.50%) ⬆️
...OpenTelemetry/Metrics/MetricStreamConfiguration.cs 80.76% <100.00%> (+5.76%) ⬆️
src/OpenTelemetry/ReadOnlyFilteredTagCollection.cs 100.00% <100.00%> (ø)
src/OpenTelemetry/ReadOnlyTagCollection.cs 100.00% <100.00%> (ø)
...xemplar/AlignedHistogramBucketExemplarReservoir.cs 60.00% <57.14%> (-32.86%) ⬇️
...nTelemetry/Metrics/Exemplar/ExemplarMeasurement.cs 63.63% <63.63%> (ø)
... and 7 more

... and 48 files with indirect coverage changes

@CodeBlanch CodeBlanch marked this pull request as ready for review February 15, 2024 23:09
@CodeBlanch CodeBlanch requested a review from a team February 15, 2024 23:09
@Yun-Ting
Copy link
Contributor

Happy to see multiple refactoring/introduction of ExemplarReservoir/FixedSizeExemplarReservoir to make Exemplar more extensible and spec-compliant!
I can help with adding multi-threaded test coverage for ExemplarReservoir as follow-ups!

switch (this.aggType)
{
case AggregationType.LongSumIncomingDelta:
{
this.mpComponents!.AcquireLock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code was using a lock() (implemented with CAS calls/spin-wait, not lock() statement.), and then doing everything inside that lock - updating the sum, doing the exemplar storage.

Now we don't do that lock, but uses multiple interlocked calls - one to update the sum, one to update measurementsSeen, one (or possibly many) more to get the right buffer, and one more to make sure no one else is updating simultaneously.

Despite this, the perf gains in stress test is significant. What could be the a good reasoning for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is just a lot less contention/spinning going on.

The stress test runs 20 threads on my machine all hammering measurements to the same counter. The measurements are writing the same tag "DimName1" with one of ten random values.

Before this PR if two measurements happen on the same random value, one of those threads is going to have to spin for the duration of the value update + exemplar update. Exemplar update can be very slow. Best case is you wait for a random value and then you don't record. Worst case is you wait for a random value and then you record which is a bunch of writes.

After this PR all measurements do an Interlocked operation for the value record portion so that spin is eliminated. Then they enter exemplar world and get a thread-local random. If there isn't going to be a record the thread is done. If there is a record, we now have Environment.ProcessorCount number of buckets available. Each thread may write to one of those buckets without blocking anything. If some thread happens to be writing to the same bucket, we just abort the record (we don't spin). Essentially in this design there is no spinning for a measurement so the threads spend more time doing work and we get better throughput.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. that explanation does makes sense.

As you can see, my original intuition was that, if we are going to need to go through multiple interlocks, then better get a lock equivalent via CAS, then do your job normal (simple additions, instead of Interlock.Add and so on.), release the lock!)

I'll also run the stress test on a poorer machine (4 cores) to see if still see some gains!
Can you add the stress test you used to this PR (we can remove it before merging so avoid bloating stress tests)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about Histograms? Even in this PR, the Update() is going to enter/exit critical section, and then for exemplar we do that again.. Would we be better off for us to do the Exemplar inside the already entered critical section? The default HistogramReservoir don't have much overhead (there is no random generation, just updating on the already known index)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran stress tests against the latest version of this PR.

  • Counters are improved in all cases.
  • Histogram
    • Improved when there is contention (lock is held for less time).
    • Slightly degraded when there isn't a lot of contention. This is due to a few extra operations being executed now in the exemplar path. I could special-case something to fix this. But if you look at the overall throughput of histograms vs counters, they suck 🤣 I would like to look at that on a follow-up and see if I can use the same active/passive buffer tricks to help histograms across the board.

Counters

New version using counter w/ tag-filtering view + exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 9,899,082,415
  • Average Loops/Second: 32,964,636
  • Average CPU Cycles/Loop: 1,727
  • GC Total Allocated Bytes: 10643640

Old version using counter w/ tag-filtering view + exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 5,076,211,789
  • Average Loops/Second: 16,909,264
  • Average CPU Cycles/Loop: 869
  • GC Total Allocated Bytes: 11098120

New version using counter w/ exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 10,881,621,737
  • Average Loops/Second: 36,250,443
  • Average CPU Cycles/Loop: 1,573
  • GC Total Allocated Bytes: 14448912

Old version using counter w/ exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 8,958,303,265
  • Average Loops/Second: 29,855,139
  • Average CPU Cycles/Loop: 1,855
  • GC Total Allocated Bytes: 11546904

Histograms

New version using histogram w/ tag-filtering view + exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 3,069,842,084
  • Average Loops/Second: 10,227,489
  • Average CPU Cycles/Loop: 3,781
  • GC Total Allocated Bytes: 11073856

Old version using histogram w/ tag-filtering view + exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 2,160,807,746
  • Average Loops/Second: 7,197,558
  • Average CPU Cycles/Loop: 2,160
  • GC Total Allocated Bytes: 16985496

New version using histogram w/ exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 5,576,901,325
  • Average Loops/Second: 18,583,167
  • Average CPU Cycles/Loop: 3,001
  • GC Total Allocated Bytes: 177120704

Old version using histogram w/ exemplars:

  • Total Runway Time (seconds) 300
  • Total Loops: 5,767,724,875
  • Average Loops/Second: 19,220,944
  • Average CPU Cycles/Loop: 2,733
  • GC Total Allocated Bytes: 265700816

Copy link
Contributor

github-actions bot commented Mar 2, 2024

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Mar 2, 2024
Copy link
Contributor

github-actions bot commented Mar 9, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics Metrics signal related pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package Stale Issues and pull requests which have been flagged for closing due to inactivity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants