Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace agent fails to start on arm64 nodes #6635

Closed
dan-lind opened this issue Feb 7, 2024 · 9 comments
Closed

Trace agent fails to start on arm64 nodes #6635

dan-lind opened this issue Feb 7, 2024 · 9 comments

Comments

@dan-lind
Copy link

dan-lind commented Feb 7, 2024

We have been running with the java dd-trace-agent on amd64 nodes for a couple of years, without any issues. We recently started running our containers on a combination of amd64 and arm64 nodes. We are now seeing that the agent fails to start properly on the arm64 nodes. 

We see very long java stack traces from the agent, and since the agent doesn't seem to start properly, we get one row from the stack trace as a separate log record, rather than as a single stack trace. 

I found this issues which looks similar and mentions a fix was release, #4702, but I guess this might be a different issue

We are running with dd-trace-agent 1.29, datadog-agent and cluster-agent 7.50.3. Below is an example of the stack-trace.

Failed to upload profile to http://localhost:8126/profiling/v1/input java.io.IOException: canceled due to 
java.lang.UnsatisfiedLinkError: could not load FFI provider jnr.ffi.provider.jffi.Provider (Will not log warnings for 5 minutes)
--
Uncaught exception java.lang.UnsatisfiedLinkError: could not load FFI provider jnr.ffi.provider.jffi.Provider in dd-profiler-http-dispatcher
java.lang.UnsatisfiedLinkError: could not load FFI provider jnr.ffi.provider.jffi.Provider
at jnr.ffi.provider.InvalidProvider$1.loadLibrary(InvalidProvider.java:49)
at jnr.ffi.LibraryLoader.load(LibraryLoader.java:420)
at jnr.unixsocket.Native.<clinit>(Native.java:80)
at jnr.unixsocket.UnixSocketChannel.<init>(UnixSocketChannel.java:101)
at jnr.unixsocket.UnixSocketChannel.open(UnixSocketChannel.java:60)
at datadog.common.socket.UnixDomainSocketFactory.createSocket(UnixDomainSocketFactory.java:27)
at datadog.okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:241)
at datadog.okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167)
at datadog.okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
at datadog.okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
at datadog.okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
at datadog.okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at datadog.okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at datadog.okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at datadog.okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at datadog.okhttp3.RealCall.execute(RealCall.java:93)
at datadog.communication.ddagent.DDAgentFeaturesDiscovery.doDiscovery(DDAgentFeaturesDiscovery.java:146)
at datadog.communication.ddagent.DDAgentFeaturesDiscovery.discoverIfOutdated(DDAgentFeaturesDiscovery.java:131)
at datadog.communication.ddagent.DDAgentFeaturesDiscovery.discover(DDAgentFeaturesDiscovery.java:115)
at datadog.communication.ddagent.SharedCommunicationObjects.featuresDiscovery(SharedCommunicationObjects.java:91)
at datadog.trace.agent.common.writer.WriterFactory.createWriter(WriterFactory.java:80)
at datadog.trace.agent.common.writer.WriterFactory.createWriter(WriterFactory.java:41)
at datadog.trace.agent.core.CoreTracer.<init>(CoreTracer.java:619)
at datadog.trace.agent.core.CoreTracer.<init>(CoreTracer.java:120)
at datadog.trace.agent.core.CoreTracer$CoreTracerBuilder.build(CoreTracer.java:448)
at datadog.trace.agent.tooling.TracerInstaller.installGlobalTracer(TracerInstaller.java:26)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at datadog.trace.bootstrap.Agent.installDatadogTracer(Agent.java:582)
at datadog.trace.bootstrap.Agent.access$400(Agent.java:68)
at datadog.trace.bootstrap.Agent$InstallDatadogTracerCallback.execute(Agent.java:468)
at datadog.trace.bootstrap.Agent.start(Agent.java:300)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at datadog.trace.bootstrap.AgentBootstrap.agentmain(AgentBootstrap.java:71)
at datadog.trace.bootstrap.AgentBootstrap.premain(AgentBootstrap.java:55)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:560)
at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:572)
Caused by: java.lang.UnsatisfiedLinkError: could not get native definition for type `POINTER`, original error message follows: java.lang.UnsatisfiedLinkError: Unable to execute or load jffi binary stub from `/tmp`. Set `TMPDIR` or Java property `java.io.tmpdir` to a read/write path that is not mounted "noexec".
/jffi10473002527918557868.so: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /jffi10473002527918557868.so)
at com.kenai.jffi.internal.StubLoader.tempLoadError(StubLoader.java:563)
at com.kenai.jffi.internal.StubLoader.loadFromJar(StubLoader.java:462)
at com.kenai.jffi.internal.StubLoader.load(StubLoader.java:338)
at com.kenai.jffi.internal.StubLoader.<clinit>(StubLoader.java:626)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:534)
at java.base/java.lang.Class.forName(Class.java:513)
at com.kenai.jffi.Init.load(Init.java:68)
at com.kenai.jffi.Foreign$InstanceHolder.getInstanceHolder(Foreign.java:50)
at com.kenai.jffi.Foreign$InstanceHolder.<clinit>(Foreign.java:46)
at com.kenai.jffi.Foreign.getInstance(Foreign.java:104)
at com.kenai.jffi.Type$Builtin.lookupTypeInfo(Type.java:242)
at com.kenai.jffi.Type$Builtin.getTypeInfo(Type.java:237)
at com.kenai.jffi.Type.resolveSize(Type.java:155)
at com.kenai.jffi.Type.size(Type.java:138)
at jnr.ffi.provider.jffi.NativeRuntime$TypeDelegate.size(NativeRuntime.java:198)
at jnr.ffi.provider.AbstractRuntime.<init>(AbstractRuntime.java:48)
at jnr.ffi.provider.jffi.NativeRuntime.<init>(NativeRuntime.java:77)
at jnr.ffi.provider.jffi.NativeRuntime.<init>(NativeRuntime.java:49)
at jnr.ffi.provider.jffi.NativeRuntime$SingletonHolder.<clinit>(NativeRuntime.java:73)
at jnr.ffi.provider.jffi.NativeRuntime.getInstance(NativeRuntime.java:60)
at jnr.ffi.provider.jffi.Provider.<init>(Provider.java:29)
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
at java.base/java.lang.reflect.ReflectAccess.newInstance(ReflectAccess.java:128)
at java.base/jdk.internal.reflect.ReflectionFactory.newInstance(ReflectionFactory.java:304)
at java.base/java.lang.Class.newInstance(Class.java:725)
at jnr.ffi.provider.FFIProvider$SystemProviderSingletonHolder.getInstance(FFIProvider.java:68)
at jnr.ffi.provider.FFIProvider$SystemProviderSingletonHolder.<clinit>(FFIProvider.java:57)
at jnr.ffi.provider.FFIProvider.getSystemProvider(FFIProvider.java:35)
at jnr.ffi.LibraryLoader.create(LibraryLoader.java:89)
at jnr.unixsocket.Native.<clinit>(Native.java:76)
at jnr.unixsocket.UnixSocketChannel.<init>(UnixSocketChannel.java:101)
at jnr.unixsocket.UnixSocketChannel.open(UnixSocketChannel.java:60)
at datadog.common.socket.UnixDomainSocketFactory.createSocket(UnixDomainSocketFactory.java:27)
at datadog.okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:241)
at datadog.okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167)
at datadog.okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
at datadog.okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
at datadog.okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
at datadog.okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at datadog.okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at datadog.okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at datadog.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at datadog.okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at datadog.okhttp3.RealCall.execute(RealCall.java:93)
at datadog.communication.ddagent.DDAgentFeaturesDiscovery.doDiscovery(DDAgentFeaturesDiscovery.java:146)
at datadog.communication.ddagent.DDAgentFeaturesDiscovery.discoverIfOutdated(DDAgentFeaturesDiscovery.java:131)
at datadog.communication.ddagent.DDAgentFeaturesDiscovery.discover(DDAgentFeaturesDiscovery.java:115)
at datadog.communication.ddagent.SharedCommunicationObjects.featuresDiscovery(SharedCommunicationObjects.java:91)
at datadog.trace.agent.common.writer.WriterFactory.createWriter(WriterFactory.java:80)
at datadog.trace.agent.common.writer.WriterFactory.createWriter(WriterFactory.java:41)
at datadog.trace.agent.core.CoreTracer.<init>(CoreTracer.java:619)
at datadog.trace.agent.core.CoreTracer.<init>(CoreTracer.java:120)
at datadog.trace.agent.core.CoreTracer$CoreTracerBuilder.build(CoreTracer.java:448)
at datadog.trace.agent.tooling.TracerInstaller.installGlobalTracer(TracerInstaller.java:26)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at datadog.trace.bootstrap.Agent.installDatadogTracer(Agent.java:582)
at datadog.trace.bootstrap.Agent.access$400(Agent.java:68)
at datadog.trace.bootstrap.Agent$InstallDatadogTracerCallback.execute(Agent.java:468)
at datadog.trace.bootstrap.Agent.start(Agent.java:300)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at datadog.trace.bootstrap.AgentBootstrap.agentmain(AgentBootstrap.java:71)
at datadog.trace.bootstrap.AgentBootstrap.premain(AgentBootstrap.java:55)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:560)
at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:572)
at com.kenai.jffi.Type$Builtin.lookupTypeInfo(Type.java:253)
at com.kenai.jffi.Type$Builtin.getTypeInfo(Type.java:237)
at com.kenai.jffi.Type.resolveSize(Type.java:155)
at com.kenai.jffi.Type.size(Type.java:138)
at jnr.ffi.provider.jffi.NativeRuntime$TypeDelegate.size(NativeRuntime.java:198)
at jnr.ffi.provider.AbstractRuntime.<init>(AbstractRuntime.java:48)
at jnr.ffi.provider.jffi.NativeRuntime.<init>(NativeRuntime.java:77)
at jnr.ffi.provider.jffi.NativeRuntime.<init>(NativeRuntime.java:49)
at jnr.ffi.provider.jffi.NativeRuntime$SingletonHolder.<clinit>(NativeRuntime.java:73)
at jnr.ffi.provider.jffi.NativeRuntime.getInstance(NativeRuntime.java:60)
at jnr.ffi.provider.jffi.Provider.<init>(Provider.java:29)
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
at java.base/java.lang.reflect.ReflectAccess.newInstance(ReflectAccess.java:128)
at java.base/jdk.internal.reflect.ReflectionFactory.newInstance(ReflectionFactory.java:304)
at java.base/java.lang.Class.newInstance(Class.java:725)
at jnr.ffi.provider.FFIProvider$SystemProviderSingletonHolder.getInstance(FFIProvider.java:68)
at jnr.ffi.provider.FFIProvider$SystemProviderSingletonHolder.<clinit>(FFIProvider.java:57)
at jnr.ffi.provider.FFIProvider.getSystemProvider(FFIProvider.java:35)
at jnr.ffi.LibraryLoader.create(LibraryLoader.java:89)
at jnr.unixsocket.Native.<clinit>(Native.java:76)
... 45 more
@PerfectSlayer
Copy link
Contributor

PerfectSlayer commented Feb 8, 2024

Hi @dan-lind 👋

From the stacktrace you shared, it looks like the native library in charge of UDS is not able to load due to a missing glibc: /jffi10473002527918557868.so: /lib64/libc.so.6: version 'GLIBC_2.27' not found (required by /jffi10473002527918557868.so)
This might happens when running on some environments like Alpine Linux that does not use it (they rely on musl libc instead), or older environment with an outdated glibc version.

Can you tell us on which environment your traced application is running on?
Moreover, can you share the output of ldd --version?

@dan-lind
Copy link
Author

dan-lind commented Feb 8, 2024

Hi! We run our containers on amazoncorretto:17/21, which bulds on amazonlinux:2, that in turn indeed builds on alpine:3.17.
But this was the case before starting using arm64 nodes as well.

Here's the output, look like the agent expects a newer version than what we have?

$ ldd --version
ldd (GNU libc) 2.26
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

@PerfectSlayer
Copy link
Contributor

We run our containers on amazoncorretto:17/21, which bulds on amazonlinux:2, that in turn indeed builds on alpine:3.17

It looks like amazonlinux:2 is built from scratch rather than alpine and comes with glib 2.26 whereas 2.27+ is required.

Moreover, I checked if there is a way to upgrade to a newer version but it seems you can't according to Amazon. Instead, you would have to use an image based on a more recent amazonlinux version.

I checked amazonlinux:2023 and it has a newer glibc version:

bash-5.2# ldd --version
ldd (GNU libc) 2.34
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

So I would recommend to switch to amazoncorretto:17-al2023 instead of amazoncorretto:17 if you can.

@dan-lind
Copy link
Author

dan-lind commented Feb 8, 2024

Oh, interesting! Is this version requirement something that has changed in a recent tracing agent version, or would it be related to us running arm nodes?

I will try with amazoncorretto:17-al2023 and get back

@PerfectSlayer
Copy link
Contributor

This is something related to our jnr-unixsocket dependency. The last time it was updated was dcff20e (for version 1.27.0 of the tracer).

About the different behavior according the architecture, I tried to load and link the library in both amd64 and arm64 docker images:

Using amd64

docker run -it --rm --platform linux/amd64 -v ${PWD}/shared/jni:/jni amazoncorretto:17

bash-4.2# uname -a
Linux 9ba5e0f9b50f 6.6.12-linuxkit #1 SMP Fri Jan 19 08:53:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
bash-4.2# ldd libjffi-1.2.so
	libc.so.6 => /lib64/libc.so.6 (0x00007fffff00d000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fffffdda000)

Using arm64

docker run -it --rm --platform linux/arm64 -v ${PWD}/shared/jni:/jni amazoncorretto:17

bash-4.2# uname -a
Linux 75e832fe605c 6.6.12-linuxkit #1 SMP Fri Jan 19 08:53:17 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
bash-4.2# ldd libjffi-1.2.so
./libjffi-1.2.so: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by ./libjffi-1.2.so)
	linux-vdso.so.1 (0x0000ffffa177d000)
	libc.so.6 => /lib64/libc.so.6 (0x0000ffffa1588000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffffa173f000)

You can see it failing on arm64 even if it succeed on amd64 with glibc 2.26.
This might explain why your instances running on amd64 are working fine while those on arm64 get this issue.

@dan-lind
Copy link
Author

dan-lind commented Feb 8, 2024

Thanks for a great explanation and for your help! I tried running with amazoncorretto:17-al2023, and the error went away

@PerfectSlayer
Copy link
Contributor

You're welcome! We will also give a shot at fixing the issue upstream, in the jffi library.

@mcculls
Copy link
Contributor

mcculls commented Feb 28, 2024

FYI: the JNR/JFFI project have released a fix for this which will be in the next release of the Java tracer, 1.31.0

Copy link
Contributor

github-actions bot commented Mar 4, 2024

🤖 This issue has been addressed in the latest release. See full details in the Release Notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants