Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: translate known OpenTelemetry JVM metrics #4986

Merged
merged 12 commits into from
Apr 1, 2021

Conversation

axw
Copy link
Member

@axw axw commented Mar 24, 2021

Motivation/summary

Copy well-known JVM metrics from their OpenTelemetry
names and associated labels to their counterparts
produced by the Elastic APM Java agent. This enables
users to visualise the metrics in the Elastic APM
app in Kibana.

Metrics are duplicated to avoid surprising users by
silently dropping the OpenTelemetry names. Users can
drop either one of the duplicates with ingest pipeline
customisation.

Checklist

How to test these changes

  1. Run apm-server
  2. Instrument a Java application with https:/open-telemetry/opentelemetry-java-instrumentation, pointed at apm-server
  3. Wait for metrics to be collected and sent to APM Server (not sure how long this is)
  4. Navigate to the APM app in Kibana, and check the GC and memory usage graphs are populated

See also #4919

Related issues

Closes #4919

Copy well-known JVM metrics from their OpenTelemetry
names and associated labels to their counterparts
produced by the Elastic APM Java agent. This enables
users to visualise the metrics in the Elastic APM
app in Kibana.

Metrics are duplicated to avoid surprising users by
silently dropping the OpenTelemetry names. Users can
drop either one of the duplicates with ingest pipeline
customisation.
@apmmachine
Copy link
Contributor

apmmachine commented Mar 24, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #4986 updated

  • Start Time: 2021-04-01T00:27:45.130+0000

  • Duration: 60 min 3 sec

  • Commit: 5ddf0a5

Test stats 🧪

Test Results
Failed 0
Passed 6278
Skipped 120
Total 6398

Trends 🧪

Image of Build Times

Image of Tests

@axw axw marked this pull request as ready for review March 24, 2021 08:52
@axw axw requested review from felixbarny and a team March 24, 2021 08:52
@axw
Copy link
Member Author

axw commented Mar 24, 2021

@bmorelli25 I haven't added any docs here, not sure where would be appropriate?

@codecov-io
Copy link

codecov-io commented Mar 24, 2021

Codecov Report

Merging #4986 (fdce970) into master (efb4c16) will increase coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4986      +/-   ##
==========================================
+ Coverage   77.26%   77.28%   +0.01%     
==========================================
  Files         179      179              
  Lines       10486    10486              
==========================================
+ Hits         8102     8104       +2     
+ Misses       2384     2382       -2     
Impacted Files Coverage Δ
beater/jaeger/common.go 91.66% <0.00%> (+8.33%) ⬆️

Copy link
Member

@felixbarny felixbarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the ingest pipeline get applied to all data streams? I'd expect it to only be in /data_stream/app_metrics/elasticsearch/ingest_pipeline/apm_opentelemetry_metrics.json

@bmorelli25
Copy link
Member

bmorelli25 commented Mar 24, 2021

@bmorelli25 I haven't added any docs here, not sure where would be appropriate?

As this is a new default ingest pipeline, I'd like to see docs added here: https:/elastic/apm-server/blob/master/docs/configuring-ingest.asciidoc. Could you add a new entry to the table that includes an added badge: added::[7.13].

We'll likely want to at least note this in our OTel docs, so I've added a note to elastic/observability-docs#467.

@axw
Copy link
Member Author

axw commented Mar 25, 2021

Could you add a new entry to the table that includes an added badge: added::[7.13].

Thanks @bmorelli25, will do.

Why does the ingest pipeline get applied to all data streams? I'd expect it to only be in /data_stream/app_metrics/elasticsearch/ingest_pipeline/apm_opentelemetry_metrics.json

@felixbarny that would be ideal. With the old way of doing things, we only ever have a single pipeline. While we maintain both the old and new ways in tandem, we're just copying the single pipeline to all data streams in the integration package. In the future it would make sense to be more targeted.

@axw axw requested a review from bmorelli25 March 25, 2021 00:44
@axw
Copy link
Member Author

axw commented Mar 25, 2021

@bmorelli25 tagging you for review on the docs change. I also added an entry for another recently introduced ingest pipeline. Not sure how much of a pain it would be, but perhaps we should document the specific mapping when we update the OTel docs. Then in the ingest pipeline docs page we can link there.

@axw axw added the v7.13.0 label Mar 25, 2021
Copy link
Member

@bmorelli25 bmorelli25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much of a pain it would be, but perhaps we should document the specific mapping when we update the OTel docs. Then in the ingest pipeline docs page we can link there.

👍 Updated the doc issue accordingly.

docs/configuring-ingest.asciidoc Show resolved Hide resolved
Copy link
Member

@bmorelli25 bmorelli25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good! Thanks.

@felixbarny
Copy link
Member

we're just copying the single pipeline to all data streams in the integration package.

Is that a limitation in APM Server or in data streams?

In the future it would make sense to be more targeted.

Is there an issue to track that?

@axw
Copy link
Member Author

axw commented Mar 25, 2021

we're just copying the single pipeline to all data streams in the integration package.

Is that a limitation in APM Server or in data streams?

APM Server. We're doing this for our own sanity, having a single source of truth for the pipeline.

In the future it would make sense to be more targeted.

Is there an issue to track that?

There wasn't, but I've just now added an item to #2631.

@mergify
Copy link
Contributor

mergify bot commented Mar 26, 2021

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b otel-jvm-metrics upstream/otel-jvm-metrics
git merge upstream/master
git push upstream otel-jvm-metrics

@axw
Copy link
Member Author

axw commented Mar 29, 2021

jenkins run the tests please

axw added 2 commits March 30, 2021 14:19
- Move `runtime.jvm.gc.collection` before `runtime.jvm.gc.time`
- Require `ctx.runtime?.jvm?.gc != null` when copying `labels.gc`
@axw axw requested a review from simitt March 31, 2021 03:48
@mergify
Copy link
Contributor

mergify bot commented Mar 31, 2021

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b otel-jvm-metrics upstream/otel-jvm-metrics
git merge upstream/master
git push upstream otel-jvm-metrics

@axw
Copy link
Member Author

axw commented Mar 31, 2021

jenkins run the tests please

@mergify
Copy link
Contributor

mergify bot commented Mar 31, 2021

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b otel-jvm-metrics upstream/otel-jvm-metrics
git merge upstream/master
git push upstream otel-jvm-metrics

@axw axw merged commit 264d151 into elastic:master Apr 1, 2021
@axw axw deleted the otel-jvm-metrics branch April 1, 2021 01:32
mergify bot pushed a commit that referenced this pull request Apr 1, 2021
* ingest: translate known OpenTelemetry JVM metrics

Copy well-known JVM metrics from their OpenTelemetry
names and associated labels to their counterparts
produced by the Elastic APM Java agent. This enables
users to visualise the metrics in the Elastic APM
app in Kibana.

Metrics are duplicated to avoid surprising users by
silently dropping the OpenTelemetry names. Users can
drop either one of the duplicates with ingest pipeline
customisation.

* Add changelog

* docs: document new ingest pipelines

* docs: move existing docs to correct position

* ingest: address review comments

- Move `runtime.jvm.gc.collection` before `runtime.jvm.gc.time`
- Require `ctx.runtime?.jvm?.gc != null` when copying `labels.gc`

* Regenerate integration package

(cherry picked from commit 264d151)

# Conflicts:
#	changelogs/head.asciidoc
axw added a commit that referenced this pull request Apr 1, 2021
* ingest: translate known OpenTelemetry JVM metrics (#4986)

* ingest: translate known OpenTelemetry JVM metrics

Copy well-known JVM metrics from their OpenTelemetry
names and associated labels to their counterparts
produced by the Elastic APM Java agent. This enables
users to visualise the metrics in the Elastic APM
app in Kibana.

Metrics are duplicated to avoid surprising users by
silently dropping the OpenTelemetry names. Users can
drop either one of the duplicates with ingest pipeline
customisation.

* Add changelog

* docs: document new ingest pipelines

* docs: move existing docs to correct position

* ingest: address review comments

- Move `runtime.jvm.gc.collection` before `runtime.jvm.gc.time`
- Require `ctx.runtime?.jvm?.gc != null` when copying `labels.gc`

* Regenerate integration package

(cherry picked from commit 264d151)

# Conflicts:
#	changelogs/head.asciidoc

* Delete head.asciidoc

Co-authored-by: Andrew Wilkins <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mergify bot pushed a commit that referenced this pull request Apr 27, 2021
* ingest: translate known OpenTelemetry JVM metrics

Copy well-known JVM metrics from their OpenTelemetry
names and associated labels to their counterparts
produced by the Elastic APM Java agent. This enables
users to visualise the metrics in the Elastic APM
app in Kibana.

Metrics are duplicated to avoid surprising users by
silently dropping the OpenTelemetry names. Users can
drop either one of the duplicates with ingest pipeline
customisation.

* Add changelog

* docs: document new ingest pipelines

* docs: move existing docs to correct position

* ingest: address review comments

- Move `runtime.jvm.gc.collection` before `runtime.jvm.gc.time`
- Require `ctx.runtime?.jvm?.gc != null` when copying `labels.gc`

* Regenerate integration package

(cherry picked from commit 264d151)

# Conflicts:
#	changelogs/7.13.asciidoc
#	docs/configuring-ingest.asciidoc
#	systemtest/otlp_test.go
@simitt
Copy link
Contributor

simitt commented May 12, 2021

Tested with BC4:
Instrumented a java test app with the otel instrumentation and waited until metrics documents existed. No metrics show up in the APM UI. @axw is there anything else I need to set/enable?

Screenshot 2021-05-12 at 15 02 26

Screenshot 2021-05-12 at 15 00 11

Update:

  • agent.name needs to be java (not opentelemetry/java) so that the JVM tab shows up
  • minimum jvm.memory.heap.used needs to exist for JVM details to be shown.

@simitt simitt self-assigned this May 12, 2021
@axw
Copy link
Member Author

axw commented May 13, 2021

agent.name needs to be java (not opentelemetry/java) so that the JVM tab shows up

Ugh :(
I don't think we should be changing the agent name, so this part will probably need to happen in the UI.

minimum jvm.memory.heap.used needs to exist for JVM details to be shown.

Hmm. Is there just one metric document? We should be capturing jvm.memory.heap.used. Maybe what's happening is that we're splitting the metrics into multiple docs - would that explain this?

@felixbarny
Copy link
Member

What might also be missing is agent.ephemeral_id. IIRC, the UI needs that for downsampling/calculating derivatives for the GC time and count.

@simitt
Copy link
Contributor

simitt commented May 18, 2021

created elastic/kibana#100253 as a follow up for showing opentelemetry JVM metrics.

Hmm. Is there just one metric document? We should be capturing jvm.memory.heap.used. Maybe what's happening is that we're splitting the metrics into multiple docs - would that explain this?

Yes, there is a single document per jvm.memory.heap metric created.

@simitt
Copy link
Contributor

simitt commented May 18, 2021

created a follow up for the metrics docs #5275

@axw
Copy link
Member Author

axw commented May 24, 2021

What might also be missing is agent.ephemeral_id. IIRC, the UI needs that for downsampling/calculating derivatives for the GC time and count.

@felixbarny there's a TypeScript definition for ephemeral_id, but no other code in the UI that actually references the field in aggs or otherwise. Does that indicate a UI bug?

@felixbarny
Copy link
Member

Does that indicate a UI bug?

Looking at the aggs that Kibana does, I think it doesn't work as I'd expect

https:/elastic/kibana/blob/1f02c48d3cf57dd921cfe64186a695f8375e3954/x-pack/plugins/apm/server/lib/metrics/by_agent/java/gc/fetch_and_transform_gc_metrics.ts#L70-L109

It does a derivative over the max value of gc.count across instances as opposed to summing the derivatives of each instance.

I guess it would be easier for the UI if the Java agent reported deltas for gc.time instead of an incrementing counter. However, it might be very tricky to adjust the OTel metrics to report deltas.
IIRC, there's some setting in OTel that allows to report deltas instead of monotonically incrementing counters. But I'm not sure if that setting also applies for the GC metrics that are captured via a longSumObserver.

@simitt
Copy link
Contributor

simitt commented May 26, 2021

@felixbarny do you mind moving this over to a new issue? It sounds like it'd be worth a follow up discussion.

@felixbarny
Copy link
Member

After giving it another thought, it's probably not an issue as that chart is only ever shown in the context of a specific instance. Therefore, the ephemeral_id is not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Visualize JVM health indicators collected by the OpenTelemetry Java Agent
7 participants