Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster health check failure can get stuck #5729

Open
milas opened this issue Apr 25, 2022 · 5 comments
Open

Cluster health check failure can get stuck #5729

milas opened this issue Apr 25, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@milas
Copy link
Contributor

milas commented Apr 25, 2022

Expected Behavior

  • If the cluster becomes unhealthy and then healthy again, Tilt reflects that both in the cluster pop-up in the UI and by "unholding" any resources waiting for cluster

Current Behavior

  • Possible for health check to get stuck in a failing state

Steps to Reproduce

This is a recent feature and we've only had this reported once via Slack, but the error was showing an error on the /livez check, and the user reported that request was succeeding via curl at that point.

Screen Shot 2022-04-22 at 9 47 08 AM

They'd mentioned getting into the state after having put their laptop to sleep for the day and returning the next morning.

@milas milas added the bug Something isn't working label Apr 25, 2022
@milas milas self-assigned this Apr 25, 2022
@andymartin-sch
Copy link
Contributor

andymartin-sch commented May 6, 2022

a few more developers of ours saw this recently - it would be nice to improve this because getting stuck here seems like a regression caused by the (otherwise great) health check functionality being added

@milas
Copy link
Contributor Author

milas commented May 6, 2022

@andymartin-sch Thanks for the extra reports - agreed this is not the experience we want here; I'm hoping to include at least some form of remediation in our release today.

In the cases you've seen, has the error shown in the Tilt UI been similar to that in the issue above? If so, do you know if anyone tried manually accessing the endpoint (e.g. curl https://..../livez) and whether that was successful?

milas added a commit that referenced this issue May 6, 2022
Set a timeout on the request, which will both be passed onto the
server as well as used to create a child context with a deadline.
A fixed value of 10sec is used here for now.

Additionally, max retries is set to 1 vs the default of up to 10.

See #5729.
@andymartin-sch
Copy link
Contributor

andymartin-sch commented May 6, 2022

In the cases you've seen, has the error shown in the Tilt UI been similar to that in the issue above?

yeah pretty much the exact same

If so, do you know if anyone tried manually accessing the endpoint (e.g. curl https://..../livez) and whether that was successful?

I don't think so but we can do that going forward and will let you know - thanks!!

@andymartin-sch
Copy link
Contributor

ah, one developer just said:

When I hit this, I went to that endpoint in my browser and it returned “ok”

milas added a commit that referenced this issue May 6, 2022
Set a timeout on the request, which will both be passed onto the
server as well as used to create a child context with a deadline.
A fixed value of 10sec is used here for now.

Additionally, max retries is set to 1 vs the default of up to 10.

See #5729.
milas added a commit that referenced this issue May 6, 2022
I'm not 100% sure this is the root cause of the cluster health
monitoring getting stuck, but it's definitely not totally correct
in its current state.

The `ConnectionManager` is a fancy wrapper over a `sync.Map` and
expects the `connection` objects to be immutable — we replace them
entirely on change. Within the context of the reconciliation loop,
this is good. However, the health status monitor also needs to store
the result of the health checks and was doing so on the same object,
which resulted in a race condition.

Now, the cluster health state is stored independently and used within
reconciliation. The main `Reconcile()` loop still replaces objects
in their entirety in the `ConnectionManager`, but as its the only writer
now, there's no potential for stale data/races.

See #5729.
milas added a commit that referenced this issue May 6, 2022
Set a timeout on the request, which will both be passed onto the
server as well as used to create a child context with a deadline.
A fixed value of 10sec is used here for now.

Additionally, max retries is set to 1 vs the default of up to 10.

See #5729.
milas added a commit that referenced this issue May 6, 2022
I'm not 100% sure this is the root cause of the cluster health
monitoring getting stuck, but it's definitely not totally correct
in its current state.

The `ConnectionManager` is a fancy wrapper over a `sync.Map` and
expects the `connection` objects to be immutable — we replace them
entirely on change. Within the context of the reconciliation loop,
this is good. However, the health status monitor also needs to store
the result of the health checks and was doing so on the same object,
which resulted in a race condition.

Now, the cluster health state is stored independently and used within
reconciliation. The main `Reconcile()` loop still replaces objects
in their entirety in the `ConnectionManager`, but as its the only writer
now, there's no potential for stale data/races.

See #5729.
@milas
Copy link
Contributor Author

milas commented May 9, 2022

A couple improvements/fixes went into v0.29.0 (released May 6) - please let me know if you still see the issue after upgrading!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants