Cluster health check failure can get stuck #5729

milas · 2022-04-25T13:11:29Z

Expected Behavior

If the cluster becomes unhealthy and then healthy again, Tilt reflects that both in the cluster pop-up in the UI and by "unholding" any resources waiting for cluster

Current Behavior

Possible for health check to get stuck in a failing state

Steps to Reproduce

This is a recent feature and we've only had this reported once via Slack, but the error was showing an error on the /livez check, and the user reported that request was succeeding via curl at that point.

They'd mentioned getting into the state after having put their laptop to sleep for the day and returning the next morning.

The text was updated successfully, but these errors were encountered:

andymartin-sch · 2022-05-06T13:31:22Z

a few more developers of ours saw this recently - it would be nice to improve this because getting stuck here seems like a regression caused by the (otherwise great) health check functionality being added

milas · 2022-05-06T13:44:01Z

@andymartin-sch Thanks for the extra reports - agreed this is not the experience we want here; I'm hoping to include at least some form of remediation in our release today.

In the cases you've seen, has the error shown in the Tilt UI been similar to that in the issue above? If so, do you know if anyone tried manually accessing the endpoint (e.g. curl https://..../livez) and whether that was successful?

Set a timeout on the request, which will both be passed onto the server as well as used to create a child context with a deadline. A fixed value of 10sec is used here for now. Additionally, max retries is set to 1 vs the default of up to 10. See #5729.

andymartin-sch · 2022-05-06T13:58:49Z

In the cases you've seen, has the error shown in the Tilt UI been similar to that in the issue above?

yeah pretty much the exact same

If so, do you know if anyone tried manually accessing the endpoint (e.g. curl https://..../livez) and whether that was successful?

I don't think so but we can do that going forward and will let you know - thanks!!

andymartin-sch · 2022-05-06T13:59:56Z

ah, one developer just said:

When I hit this, I went to that endpoint in my browser and it returned “ok”

Set a timeout on the request, which will both be passed onto the server as well as used to create a child context with a deadline. A fixed value of 10sec is used here for now. Additionally, max retries is set to 1 vs the default of up to 10. See #5729.

I'm not 100% sure this is the root cause of the cluster health monitoring getting stuck, but it's definitely not totally correct in its current state. The `ConnectionManager` is a fancy wrapper over a `sync.Map` and expects the `connection` objects to be immutable — we replace them entirely on change. Within the context of the reconciliation loop, this is good. However, the health status monitor also needs to store the result of the health checks and was doing so on the same object, which resulted in a race condition. Now, the cluster health state is stored independently and used within reconciliation. The main `Reconcile()` loop still replaces objects in their entirety in the `ConnectionManager`, but as its the only writer now, there's no potential for stale data/races. See #5729.

Set a timeout on the request, which will both be passed onto the server as well as used to create a child context with a deadline. A fixed value of 10sec is used here for now. Additionally, max retries is set to 1 vs the default of up to 10. See #5729.

I'm not 100% sure this is the root cause of the cluster health monitoring getting stuck, but it's definitely not totally correct in its current state. The `ConnectionManager` is a fancy wrapper over a `sync.Map` and expects the `connection` objects to be immutable — we replace them entirely on change. Within the context of the reconciliation loop, this is good. However, the health status monitor also needs to store the result of the health checks and was doing so on the same object, which resulted in a race condition. Now, the cluster health state is stored independently and used within reconciliation. The main `Reconcile()` loop still replaces objects in their entirety in the `ConnectionManager`, but as its the only writer now, there's no potential for stale data/races. See #5729.

milas · 2022-05-09T14:20:18Z

A couple improvements/fixes went into v0.29.0 (released May 6) - please let me know if you still see the issue after upgrading!

milas added the bug Something isn't working label Apr 25, 2022

milas self-assigned this Apr 25, 2022

milas mentioned this issue May 6, 2022

cluster: reduce timeout/retries on cluster health check #5780

Merged

milas mentioned this issue May 6, 2022

cluster: refactor health monitoring to avoid getting stuck #5781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster health check failure can get stuck #5729

Cluster health check failure can get stuck #5729

milas commented Apr 25, 2022

andymartin-sch commented May 6, 2022 •

edited

Loading

milas commented May 6, 2022

andymartin-sch commented May 6, 2022 •

edited

Loading

andymartin-sch commented May 6, 2022

milas commented May 9, 2022

Cluster health check failure can get stuck #5729

Cluster health check failure can get stuck #5729

Comments

milas commented Apr 25, 2022

Expected Behavior

Current Behavior

Steps to Reproduce

andymartin-sch commented May 6, 2022 • edited Loading

milas commented May 6, 2022

andymartin-sch commented May 6, 2022 • edited Loading

andymartin-sch commented May 6, 2022

milas commented May 9, 2022

andymartin-sch commented May 6, 2022 •

edited

Loading

andymartin-sch commented May 6, 2022 •

edited

Loading