Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry bridge #1631

Merged
merged 117 commits into from
Mar 21, 2022
Merged

OpenTelemetry bridge #1631

merged 117 commits into from
Mar 21, 2022

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented Jan 25, 2021

What does this PR do?

Checklist

  • This is an enhancement of existing features, or a new feature in existing plugins
    • I have updated CHANGELOG.asciidoc
    • I have added tests that prove my fix is effective or that my feature works
    • Added an API method or config option? Document in which version this will be introduced
    • I have made corresponding changes to the documentation
  • This is a bugfix
  • This is a new plugin
    • I have updated CHANGELOG.asciidoc
    • My code follows the style guidelines of this project
    • I have made corresponding changes to the documentation
    • I have added tests that prove my fix is effective or that my feature works
    • New and existing unit tests pass locally with my changes
    • I have updated supported-technologies.asciidoc
    • Added an API method or config option? Document in which version this will be introduced
    • Added an instrumentation plugin? Describe how you made sure that old, non-supported versions are not instrumented by accident.
      That's ensured by matching on the version
  • This is something else

Other TODOs

  • Map Elastic APM specific attributes, similar to the OpenTelemetry bridge

@apmmachine
Copy link
Contributor

apmmachine commented Jan 25, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-21T09:28:48.078+0000

  • Duration: 44 min 17 sec

Test stats 🧪

Test Results
Failed 0
Passed 2840
Skipped 20
Total 2860

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run benchmark tests : Run the benchmark tests.

  • run jdk compatibility tests : Run the JDK Compatibility tests.

  • run integration tests : Run the Agent Integration tests.

  • run end-to-end tests : Run the APM-ITs.

  • run windows tests : Build & tests on windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@AlexanderWert AlexanderWert added this to the 7.13 milestone Jan 25, 2021
@AlexanderWert AlexanderWert removed this from the 7.13 milestone Jan 25, 2021
@felixbarny
Copy link
Member Author

This is feature complete now. It still needs some polish. The checkboxes in the issue description reflect the remaining todos.

One thing to note is that I've changed how experimental plugins are handled.
The disable_instrumentations option now doesn't control whether or not experimental plugins are enabled. There's a separate option for it now: enable_experimental_instrumentations (default: false). I didn't change the underlying mechanism though. The enable_experimental_instrumentations setting controls whether experimental is added or removed from the disable_instrumentations list.
I can easily revert that but I like it better that way. Especially because now you don't have to remember to append the default value of experimental when disabling a specific instrumentation (for example disable_instrumentations=experimental,executor-service). It has an impact on users that have enabled experimental plugins by setting disable_instrumentations=. I wouldn't consider that to be a breaking change though as by enabling experimental plugins is users opt-into something experimental by definition. We'll have to highlight this in the release notes though.

@SylvainJuge
Copy link
Member

FYI I have updated version to 1.30.0-SNAPSHOT as OTel bridge is documented for 1.30.0 version.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic! 👏
Mostly minor comments.
I didn't verify everything is covered properly through tests.


public OTelSpan(AbstractSpan<?> span) {
this.span = span;
span.incrementReferences();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that balanced through span end/report or is that an intentional span leak so to avoid recycling (like we do with public-API-touched spans)?
If it is the latter, we should consider somehow making a distinction between library-inherent API usage and custom instrumentations, not entirely sure how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an "intentional leak" to have a similar behavior like our Agent Public API where spans/transactions are not recycled.
Given the OTel API provides ways to wrap an active span into Callable or Runnable there are less opportunities for end-users to mis-use it, so maybe in the future we could optionally enable recycling.

We don't have any way to distinguish OTel API usages, so "custom instrumentation" and "library that explicitly uses it", at best we could make some heuristics by the provided fields and values but that sounds quite brittle. What would be the benefit to see a difference between the two ?

Copy link
Contributor

@eyalkoren eyalkoren Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit to see a difference between the two ?

That inherent library instrumentations are less likely to misuse the API.
I'd like to provide the absolutely minimal overhead for such, mainly thinking on possible ES instrumentation, but not only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only about misusing the API. It's valid for a span to be used as a parent after it has ended. This would set the reference count to 0, yet it can't be recycled as it may be used as a parent of another span.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then maybe we can do something even safer that will move the lifecycle management responsibility from us to the using library through the API, so that "advanced" libraries can optionally tell us when a span can be recycled.
I am not sure about the feasibility of convincing OTel to add an optional recycle()/dispose() functionality to the API, but we can use the current API, e.g. with a special "recycle" event through addEvent, or a second invocation of end.
It may be a significant advantage if we can pull off recycling for "specialist" libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We thought of a solution that may be as efficient and simple to get done: creating a tiny "OTel API extension" that contains not much more than Tracer.recycle(Object) with a noop implementation that can be overridden dynamically with a concrete implementation. Then it's both efficient, robust and doesn't interfere with agent vendor.
But first thing is to measure what the actual overhead of not recycling is.

Comment on lines +112 to +116
@Nullable
private OTelSpanKind otelKind = null;

private final Map<String, Object> otelAttributes = new HashMap<>();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this is here and not in OTelSpan? I assume it's because of serialization and trying to follow are the guidelines:

  1. we don't want a separate OTel branch under AbstractSpan, so we cannot extend or implement the internal types in the OTel plugin
  2. OTelSpan wraps an AbastractSpan, so it knows about it but not vice versa

I would be happy if we can lose any OTel stuff in our internal code as much as possible.

With the attributes it's easy since it's only a Map<String, Object> so just renaming to attributes would do. For the otelKind it's also easy - since we only need its toString implementation, the field can be of type Object and renamed kind and OTelSpanKind can go into the plugin.

However, this will not accommodate future additions (like OTel events?) and it doesn't lose the OTel-specific serialization from our general-purpose serializer.

Proposal

Maybe a more generic solution is to introduce another interface to the co.elastic.apm.agent.impl.context package:

interface CustomContext extends Recyclable
{
  serialize(JsonWriter);
}

The plugin will implement it with attributes and kind and the serialize implementation would be the current DslJsonSerializer#serializeOTel implementation. It may eliminate the need to copy the attributes to an additional map.
It will be another AbstractSpan field that will be serialized in the appropriate order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to look at an alternative for this, maybe close to what you suggested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical that pulling out the attributes out of AbstractSpan is possible. Think about the case where someone uses the OTel API to add a custom attribute to a span or transaction that has been created by the auto-instrumentation of the agent.
The OTelSpan is just an ephemeral bridge object where we should think twice before adding state to it.

Copy link
Contributor

@eyalkoren eyalkoren Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, we shouldn't add a state to OTelSpan, the AbstractSpan would still own the bridge context. It's not where the data resides, only where the implementation resides. I prefer the bridge to maintain this part of the context, which is tightly coupled with the OTel API.

This is what I propose (calling it FreeFormContext because the internal context already has room for custom context):
We add another context type that has a very loose API:

interface FreeFormContext extends Recyclable
{
  serialize(JsonWriter);
}

Anything that implements it is also responsible for proper serialization of it. Not sure about recycling, but I think that as long as we don't allow multiple types of that within the same runtime, it can be set once and then only recycled. That is only relevant if we really find a proper way to recycle these spans of course.

Then we add another context property to spans:

public abstract class AbstractSpan<T extends AbstractSpan<T>> implements Recyclable {
    ....
    public abstract AbstractContext getContext();
    ....
    public abstract void setFreeFormContext(FreeFormContext);
    public abstract FreeFormContext getFreeFormContext();
}

The bridge will have a OTelSpanContext implements FreeFormContext and instead of doing:

span.getOtelAttributes().put("key", "value");

OTelSpan would do something like:

((OTelSpanContext)span.getFreeFormContext()).setAttribute("key", "value");

I hope this makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean here, but I'm not sure that we can generically handle the JSON serialization part in an elegant way.
We could even allow the plugin to create dedicated sub-classes of Transaction and Span for that purpose.

For example, we have to assume that the serialized form will always be in the same place in JSON.
In this case, we have an extra otel attribute: span.otel object with span.otel.kind and span.otel.attributes, but any other implementation could use a different structure.

While I like the idea to keep the core of the agent as OTel-agnostic as possible, I don't think that we should handle this right now:

  • we only have a single implementation, thus it's probably too early to create an abstraction that will fit a future abstraction.
  • I don't see any other possible implementation candidate for this.
  • it's only a couple of fields, thus it seems an acceptable compromise, and we can always revisit this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'm not sure that we can generically handle the JSON serialization part in an elegant way.

Wouldn't a copy+paste of the current serialization work without any change?

For example, we have to assume that the serialized form will always be in the same place in JSON.

JSON relies on nesting. You can assume that you are in the right place within the "parent" node serialization because it is the one invoking you and once you are invoked, you serialize the nested tree as you know it should be.

While I like the idea to keep the core of the agent as OTel-agnostic as possible, I don't think that we should handle this right now: ...

I see it exactly the opposite 😄
Even though it seems it is going this direction, we don't yet know whether using us with OTel is going to be the absolute norm. It seemed to go there with OpenTracing before. Currently we add fields to our core to support niche cases.
I don't think about other implementations, it's not about abstraction for the purpose of polymorphism, it's abstraction for the purpose of proper responsibility separation - the bridge should know about the core, the core should not know about the bridge.

That said, I am only explaining to emphasize my rationale, it's not something I will insist on so feel free to make the call and leave as is.

@@ -0,0 +1,244 @@
@opentelemetry-bridge
Feature: OpenTelemetry bridge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

otelTracer = openTelemetry.getTracer(null);

// otel spans are not recycled for now
disableRecyclingValidation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this answers my question above - we leak those on purpose.
Makes sense in general.
It looks like it will be quite difficult to maintain refs properly. Paying this extra overhead for arbitrary manual usage of the API worth it, but ideally we could also have a more optimized mode for libraries that embed OTel API usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this answers my previous question on the distinction between manual instrumentation and calls to the OTel API that are already in an existing library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this answers my previous question on the distinction between manual instrumentation and calls to the OTel API that are already in an existing library.

Just like happened to you, I only got here now 😄

docs/apis.asciidoc Outdated Show resolved Hide resolved
docs/api-opentelemetry.asciidoc Outdated Show resolved Hide resolved
@SylvainJuge SylvainJuge added the await-release Mark issues that depend on next release, or PRs that are planned to be included label Mar 14, 2022
@AlexanderWert AlexanderWert modified the milestones: 8.0, 8.2 Mar 15, 2022
Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the separation of responsibilities both here and offline and that's good enough for me. It's definitely not critical, only a design perspective, so I am perfectly fine with leaving it as is.

@SylvainJuge SylvainJuge enabled auto-merge (squash) March 21, 2022 09:28
@SylvainJuge SylvainJuge merged commit ddb8c69 into elastic:main Mar 21, 2022
@SylvainJuge SylvainJuge removed the await-release Mark issues that depend on next release, or PRs that are planned to be included label May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-java focus instrumentation Instrumentation: framework support, custom plugins, ... new-feature New feature size:large Large (L) tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[META 524] OpenTelemetry Bridge
10 participants