Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Fix crash in PeriodicExportingMetricReader. #2983

Merged
merged 16 commits into from
Jul 19, 2024

Conversation

owent
Copy link
Member

@owent owent commented Jul 1, 2024

Fixes #2982

Changes

  • Keep the lifetimes of PeriodicExportingMetricReader and a local variable when calling std::async.

@lalitb @marcalff Just wondering why we use std::async(std::launch::async, ...) here? It will create a lot of thread when timeout occurs frequently, and it will use a lot of resource and slow down the whole application then.

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

@owent owent requested a review from a team July 1, 2024 07:51
Copy link

codecov bot commented Jul 1, 2024

Codecov Report

Attention: Patch coverage is 60.00000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 87.60%. Comparing base (497eaf4) to head (98905fd).
Report is 107 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2983      +/-   ##
==========================================
+ Coverage   87.12%   87.60%   +0.48%     
==========================================
  Files         200      190      -10     
  Lines        6109     5869     -240     
==========================================
- Hits         5322     5141     -181     
+ Misses        787      728      -59     
Files Coverage Δ
...metrics/export/periodic_exporting_metric_reader.cc 78.44% <60.00%> (-0.91%) ⬇️

... and 120 files with indirect coverage changes

sdk/src/metrics/export/periodic_exporting_metric_reader.cc Outdated Show resolved Hide resolved
sdk/src/metrics/export/periodic_exporting_metric_reader.cc Outdated Show resolved Hide resolved
OTEL_INTERNAL_LOG_ERROR(
"[Periodic Exporting Metric Reader] Collect took longer configured time: "
<< static_cast<PeriodicExportingMetricReader *>(keep_lifetime.get())
->export_timeout_millis_.count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - same as earlier "do we need the typecast ?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep_lifetime is std::shared_ptr<MetricReader> but we need PeriodicExportingMetricReader* here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead make PeriodicExportingMetricReader inherit from std::enable_shared_from_this<PeriodicExportingMetricReader>, and remove for MetricReader, if this allows avoiding typecast.

class PeriodicExportingMetricReader : public MetricReader, public std::enable_shared_from_this<PeriodicExportingMetricReader> {
    // ... rest of the class ...
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it's a better solution for just a temporary fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so to clarify expectations:

Is the public std::enable_shared_from_this<PeriodicExportingMetricReader> meant to be temporary, until a better solution is found ?

I do agree it fixes the immediate crash, so the code no longer keep a pointer to a local variable in dead memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, assuming this is a temporary fix.

sdk/src/metrics/export/periodic_exporting_metric_reader.cc Outdated Show resolved Hide resolved
@lalitb
Copy link
Member

lalitb commented Jul 5, 2024

Just wondering why we use std::async(std::launch::async, ...) here? It will create a lot of thread when timeout occurs frequently, and it will use a lot of resource and slow down the whole application then.

@owent Can you please elaborate the issue? As I understand, we won't start a new collect if the previous is already ongoing. So timeout, shouldn't create a new thread.

@lalitb
Copy link
Member

lalitb commented Jul 5, 2024

Thanks for the fix. Some nit comments, but in general looks good.

@owent
Copy link
Member Author

owent commented Jul 5, 2024

Just wondering why we use std::async(std::launch::async, ...) here? It will create a lot of thread when timeout occurs frequently, and it will use a lot of resource and slow down the whole application then.

@owent Can you please elaborate the issue? As I understand, we won't start a new collect if the previous is already ongoing. So timeout, shouldn't create a new thread.

The future_receive in this function will be dropped after PeriodicExportingMetricReader::CollectAndExportOnce when timeout.
And maybe the exporting is still running.In my understand, PeriodicExportingMetricReader::DoBackgroundWork will call PeriodicExportingMetricReader::CollectAndExportOnce sooner after previous timeout.

The coredump file of crashed process in my application show it has about 290+ threads.

Copy link
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. LGTM with nit comments.

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

See comments, for a proposed different implementation.

Marking as request changes for now to discuss it,
will revise the review if it turns out to be not practical.

sdk/include/opentelemetry/sdk/metrics/metric_reader.h Outdated Show resolved Hide resolved
sdk/src/metrics/export/periodic_exporting_metric_reader.cc Outdated Show resolved Hide resolved
sdk/src/metrics/export/periodic_exporting_metric_reader.cc Outdated Show resolved Hide resolved
@lalitb
Copy link
Member

lalitb commented Jul 5, 2024

To add, my approval assumes this PR fixes the crash by keeping PeriodicExportingMetricReader alive during async ops, and thread explosion during timeouts will be tracked/addressed separately.

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix.

Assuming this is a temporary solution just to prevent the crash.

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI for this PR is consistently stuck, when executing the "exporter proto" CI unit tests.

All tests are executed, but the workflow does not stop.

I suspect that some thread is still blocked, either in the unit tests or in the exporter itself, that prevents the process to cleanly stop.

This happens only for this PR, so it does not look like a general issue.

Please investigate.

Even if some code needs to wait on something that never happens, it is desirable to wait with a timeout and retry, printing something in the internal log in each loop, so the code is easier to troubleshoot if this happens again.

@owent
Copy link
Member Author

owent commented Jul 15, 2024

CI for this PR is consistently stuck, when executing the "exporter proto" CI unit tests.

All tests are executed, but the workflow does not stop.

I suspect that some thread is still blocked, either in the unit tests or in the exporter itself, that prevents the process to cleanly stop.

This happens only for this PR, so it does not look like a general issue.

Please investigate.

Even if some code needs to wait on something that never happens, it is desirable to wait with a timeout and retry, printing something in the internal log in each loop, so the code is easier to troubleshoot if this happens again.

Sorry, I'm too busy these days. I will continue this some timer later.

@owent
Copy link
Member Author

owent commented Jul 16, 2024

Thanks for the PR.

See comments, for a proposed different implementation.

Marking as request changes for now to discuss it, will revise the review if it turns out to be not practical.

I use another implementation to avoid some BUGs in STL. Please review it again.

@owent owent requested a review from marcalff July 17, 2024 11:21
@marcalff marcalff changed the title Fix crash in PeriodicExportingMetricReader. [SDK] Fix crash in PeriodicExportingMetricReader. Jul 19, 2024
Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix.

Thanks also for the extra cleanup for include-what-you-use.

@marcalff marcalff merged commit 4520aa5 into open-telemetry:main Jul 19, 2024
52 checks passed
@owent owent deleted the fixes_2982 branch August 2, 2024 03:55
@owent owent mentioned this pull request Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EXPORTER] Spurious crash when using PeriodicExportingMetricReader and timeout.
3 participants