-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster health check failure can get stuck #5729
Comments
a few more developers of ours saw this recently - it would be nice to improve this because getting stuck here seems like a regression caused by the (otherwise great) health check functionality being added |
@andymartin-sch Thanks for the extra reports - agreed this is not the experience we want here; I'm hoping to include at least some form of remediation in our release today. In the cases you've seen, has the error shown in the Tilt UI been similar to that in the issue above? If so, do you know if anyone tried manually accessing the endpoint (e.g. |
Set a timeout on the request, which will both be passed onto the server as well as used to create a child context with a deadline. A fixed value of 10sec is used here for now. Additionally, max retries is set to 1 vs the default of up to 10. See #5729.
yeah pretty much the exact same
I don't think so but we can do that going forward and will let you know - thanks!! |
ah, one developer just said:
|
Set a timeout on the request, which will both be passed onto the server as well as used to create a child context with a deadline. A fixed value of 10sec is used here for now. Additionally, max retries is set to 1 vs the default of up to 10. See #5729.
I'm not 100% sure this is the root cause of the cluster health monitoring getting stuck, but it's definitely not totally correct in its current state. The `ConnectionManager` is a fancy wrapper over a `sync.Map` and expects the `connection` objects to be immutable — we replace them entirely on change. Within the context of the reconciliation loop, this is good. However, the health status monitor also needs to store the result of the health checks and was doing so on the same object, which resulted in a race condition. Now, the cluster health state is stored independently and used within reconciliation. The main `Reconcile()` loop still replaces objects in their entirety in the `ConnectionManager`, but as its the only writer now, there's no potential for stale data/races. See #5729.
Set a timeout on the request, which will both be passed onto the server as well as used to create a child context with a deadline. A fixed value of 10sec is used here for now. Additionally, max retries is set to 1 vs the default of up to 10. See #5729.
I'm not 100% sure this is the root cause of the cluster health monitoring getting stuck, but it's definitely not totally correct in its current state. The `ConnectionManager` is a fancy wrapper over a `sync.Map` and expects the `connection` objects to be immutable — we replace them entirely on change. Within the context of the reconciliation loop, this is good. However, the health status monitor also needs to store the result of the health checks and was doing so on the same object, which resulted in a race condition. Now, the cluster health state is stored independently and used within reconciliation. The main `Reconcile()` loop still replaces objects in their entirety in the `ConnectionManager`, but as its the only writer now, there's no potential for stale data/races. See #5729.
A couple improvements/fixes went into v0.29.0 (released May 6) - please let me know if you still see the issue after upgrading! |
Expected Behavior
Current Behavior
Steps to Reproduce
This is a recent feature and we've only had this reported once via Slack, but the error was showing an error on the
/livez
check, and the user reported that request was succeeding viacurl
at that point.They'd mentioned getting into the state after having put their laptop to sleep for the day and returning the next morning.
The text was updated successfully, but these errors were encountered: