Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in OLTP HTTP export #2713

Closed
VivekSubr opened this issue Jun 21, 2024 · 13 comments
Closed

Crash in OLTP HTTP export #2713

VivekSubr opened this issue Jun 21, 2024 · 13 comments
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@VivekSubr
Copy link

VivekSubr commented Jun 21, 2024

Describe your environment
Built and running on linux,

cmake .. -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_STANDARD=17 \
         -DWITH_STL=CXX17 -DBUILD_SHARED_LIBS=ON -DWITH_OTLP_HTTP=ON -DWITH_OTLP_GRPC=ON -DBUILD_TESTING=OFF

Protobuf version installed - 3.17.3

Steps to reproduce
Don't have exact steps to reproduce, happens intermittently.

Backtrace

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000001b83c7a7d in ?? ()
[Current thread is 1 (Thread 0x7810b784da00 (LWP 23))]
#0  0x00000001b83c7a7d in ?? ()
#1  0x00007ffc046f68b0 in ?? ()
#2  0x00007810b87a237d in google::protobuf::RepeatedPtrField<opentelemetry::proto::trace::v1::ResourceSpans>::~RepeatedPtrField() ()
   from /lib64/libopentelemetry_exporter_otlp_grpc.so
#3  0x00007810b87a174a in opentelemetry::proto::collector::trace::v1::ExportTraceServiceRequest::~ExportTraceServiceRequest() ()
   from /lib64/libopentelemetry_exporter_otlp_grpc.so
#4  0x00007810b87721ff in opentelemetry::v1::exporter::otlp::OtlpHttpExporter::Export(opentelemetry::v1::nostd::span<std::unique_ptr<--Type <RET> for more, q to quit, c to continue without paging--
opentelemetry::v1::sdk::trace::Recordable, std::default_delete<opentelemetry::v1::sdk::trace::Recordable> >, 18446744073709551615ul> const&) () from /lib64/libopentelemetry_exporter_otlp_http.so
#5  0x00007810ba07113b in opentelemetry::v1::sdk::trace::SimpleSpanProcessor::OnEnd (this=0x6299ed3d66a0, span=...)
    at /usr/include/opentelemetry/sdk/trace/simple_processor.h:51
#6  0x00007810b88cd9ba in opentelemetry::v1::sdk::trace::MultiSpanProcessor::OnEnd(std::unique_ptr<opentelemetry::v1::sdk::trace::Recordable, std::default_delete<opentelemetry::v1::sdk::trace::Recordable> >&&) () from /lib64/libopentelemetry_trace.so
#7  0x00007810b88d6654 in opentelemetry::v1::sdk::trace::Span::End(opentelemetry::v1::trace::EndSpanOptions const&) ()
   from /lib64/libopentelemetry_trace.so

Additional Info

Crash appears to be on destruction of arena object in,
https:/open-telemetry/opentelemetry-cpp/blob/main/exporters/otlp/src/otlp_http_exporter.cc#L102

It's not apparent why this might happen... any help will be appreciated.

@VivekSubr VivekSubr added the bug Something isn't working label Jun 21, 2024
@github-actions github-actions bot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 21, 2024
@owent
Copy link
Member

owent commented Jun 25, 2024

What's your version of otel-cpp and do you enable async exporting?
There was a thread safety problem before 1.10.0 in OTLP HTTP exporter when otel-cpp is built without async export(Without -DENABLE_ASYNC_EXPORT or WITH_ASYNC_EXPORT_PREVIEW).

@VivekSubr
Copy link
Author

@owent - 1.15, haven't enabled async exporting... is async export still in preview in 1.15?

@owent
Copy link
Member

owent commented Jun 26, 2024

@owent - 1.15, haven't enabled async exporting... is async export still in preview in 1.15?

gRPC async exporting is still in preview.

@owent
Copy link
Member

owent commented Jun 29, 2024

Does this problem happens when shuting down? Do you compile both otel-cpp and proto as dynamic library?Just wondering why the destructor of RepeatedPtrField<opentelemetry::proto::trace::v1::ResourceSpans> is in gRPC exporter.

@VivekSubr
Copy link
Author

It's HTTP exporter, and proto is from yum install.

We're investigating if it's memory corruption from somewhere else.

@owent
Copy link
Member

owent commented Jun 29, 2024

It's HTTP exporter, and proto is from yum install.

We're investigating if it's memory corruption from somewhere else.

Do you mean protobuf? I reviewed the codes and found the messages and arena will not leave the scope of OtlpHttpExporter::Export in my understanding.

@owent
Copy link
Member

owent commented Jul 1, 2024

I found another crash in #2982 when using metrics and timeout happens. Not sure if it relates this one.

@marcalff marcalff added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 1, 2024
@msiddhu
Copy link
Contributor

msiddhu commented Aug 8, 2024

Is there any solutions for this? I'm also facing this SIGSEV.

Using OTEL v1.16.1 , OTLP HTTP Exporter, Batch Processor.

@lalitb
Copy link
Member

lalitb commented Aug 8, 2024

@msiddhu Are you getting this crash during application shutdown? If yes, does doing ForceFlush() before shutdown helps?

@marcalff
Copy link
Member

marcalff commented Aug 8, 2024

@msiddhu Thanks for the separate confirmation.

Do you have more details, like a call stack ?

Saying "it crashes for me too" gives us next to nothing to work with.

@marcalff
Copy link
Member

marcalff commented Aug 8, 2024

The part which is really dubious is:

  • a bug report about OTLP HTTP
  • a call stack pointing to libopentelemetry_exporter_otlp_grpc.so

Is this about OTLP HTTP or OLTP GRPC ?
Was the application built with OTLP HTTP alone, OTLP GRPC alone, or both ?

@owent
Copy link
Member

owent commented Aug 10, 2024

@michalpristas Could you try main branch or #2983 ? Some std::async implementations of STLs may have bugs and crash sometimes, this PR replace these APIs with the more stable one.
We don't find more coredumps for servel days after this patch in our system.

@VivekSubr
Copy link
Author

We have not observed this crash after removing patch mentioned in #2382

The build failure ultimately boiled down to someone having done #define U in another library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants