-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop Distributor on fatal error #1887
Conversation
* occasionally otel receivers will report a fatal error, and expected behavior is that the Host is stopped * match this behavior by stopping the receiver shim service and letting the distributor stop itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Can we get a changelog entry?
If you think we need a test added to prevent regression, let me know
If you can think of a meaningful way to test this, we would appreciate it.
modules/distributor/receiver/shim.go
Outdated
@@ -279,6 +279,7 @@ func (r *receiversShim) ConsumeTraces(ctx context.Context, td ptrace.Traces) err | |||
// ReportFatalError implements component.Host | |||
func (r *receiversShim) ReportFatalError(err error) { | |||
_ = level.Error(log.Logger).Log("msg", "fatal error reported", "err", err) | |||
r.StopAsync() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tested this? i honestly don't know the services code very well. this may simply stop the receiver, but not cause the distributor to exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have not yet, writing an otel receiver mock to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(it doesnt work, working on that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got this working by switching the shim to be a basicservice
Will update this PR with some testing results on Monday (fed holiday tomorrow I'm off)
Some context before I get into the testing:
After deleting this subscription observed
And saw pods restart after the logged fatal error
|
Oh I need to amend to add a changelog entry with the bugfix |
824de70
to
855104d
Compare
* cant use idle service to do this i think * also add changelog entry * fix my bad copypaste
855104d
to
6310987
Compare
What this PR does:
If you think we need a test added to prevent regression, let me know
Which issue(s) this PR fixes:
No Issue
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]