-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide different types of health check #35
Comments
As per #29 proposal would you map different types of health checks to Could be |
@rsvoboda I think your later proposal |
As discussed in August 1st Call, this issue should be postponed to 1.1 as 1.0 should focus on delivering a single /health endpoint. |
It's probably time to revisit this issue. Having 2 endpoints is crucial in Kubernetes environment as liveness and readiness checks are different and serve different purposes. They may likely use a different set of procedures. |
+1 to revisiting. In terms of endpoints, I like the idea of Although maybe it needs to go further and retain the existing |
I think I prefer the Now the question is what should /health return? A composition of both readiness and liveness checks? |
Maybe |
One concern with It's not as clean, but it might be better to treat existing Thoughts? |
What do you mean by you have a problem? I don't see any problems yet. I assume that "readiness" checks are very cheap as they are expected to be called very often during a start-up phase. I also assume that all "readiness" checks are also very simple "liveness" checks, so it shouldn't be a problem to include them in "liveness" checks. Do you have any examples where a "readiness" check wouldn't make sense as a "liveness" check? |
If that was true, then you wouldn't need to distinguish between liveness and readiness at all. My favorite example to illustrate the difference between liveness and readiness is a vital external service, e.g. a database. If the application can't reach its database, it is definitely not ready (= can't accept requests), but may very well be alive (= doesn't need to be restarted). If this readiness check would be used as a liveness check as well, then the application would get needlessly restarted. Another possibility is that becoming ready may take a while (populating internal caches or whatnot) -- again, the application is alive but not ready, and if readiness checks are used as liveness checks, the application will be restarted as not alive, actually preventing it from reaching the ready state. Makes sense? |
Can we fix this issue for Health Check 1.1 release? |
Introduction of For backward compatibility I suggest that we keep |
To keep the URL space occupied by MP as small as possible, I'd suggest not introducing |
@Ladicek I understand the url space saving requirement, but wouldn't it be confusing to have What about |
I don't really have strong feelings about this, and I agree it might be confusing. |
I prefer the end point of /health, /health/ready and /health/live |
Agree that Very much disagree that |
I would also prefer having everything under |
+1 to everything under |
Started PR in these directions |
For CF (single endpoint), behavior of "health check" (avoiding URL) is readiness. As soon as the check returns a 200, requests are routed. In this case, you configure a delay (don't look for ok health responses for some interval after process start). If the health check returns anything else, the process is restarted. This means, in practice, that a CF health check == readiness. Kubernetes can also be configured with initial delays. You can get by in Kubernetes by only specifying a readiness check with an internal delay (though less than ideal). We need to remember that these services are transient things, and avoid being overly-sensitive to killing and restarting the process. If it isn't ready, it most likely should die (unless it is already dying), as that is the safest way to re-establish connections and return a process to health. |
Yea, well, there's probably a reason why liveness and readiness are two distinct concepts in Kubernetes. I have to vehemently disagree with
That's not true. If it isn't alive, it should die. There are legitimate cases when a service is alive but not yet ready. |
"service is alive but not yet ready." -- yes, which is why the (albeit fragile) initialization delay exists. Kubernetes liveness/readiness definitely makes this part more robust / less fragile (and even then, properly setting Assuming something reached ready state, if it subsequently stops being ready, I stand by what I said re: safety. Restarting is the most generally applicable/safest way to recover from problems (equivalent to turning it off and back on again) |
+1 to this feature. will be great to distinguish between the two cases (readiness and liveness). We will take advantage of this in our helm chart. |
In terms of URLs, I'd like to see |
My apologies if this thread does not match latest thinking, but here's my 2c regarding the mapping of Liveness vs Readiness to MP Health because some of the above discussion doesn't fit my understanding. When I look at the definition for Livenss & Readiness in the Kubernetes docs [1] it says: The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs. The kubelet uses readiness probes to know when a Container is ready to start accepting traffic. A Pod is considered ready when all of its Containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers. With the current MP Health examples I've seen, they've talked about testing database connections or service dependencies. If the test of the connection or service fails, we return "DOWN" for the Health. That, to my mind, most closely matches Kubernetes' definition of Readiness. The app and server are functioning as expected but aren't ready to receive work. No amount of container restarting will fix things. Regarding Liveness, I'm a bit skeptical as to whether an application can say whether it's alive or not. If it's dead, it can't say so. :/ . Kubernetes Liveness feels like a lower level concern. A bit like a, can I ping my server to see if it's there, and if it's not, I'll restart the container in the hope it comes back. Maybe I'm alone in this view. I would therefore assert: Readiness => Current MP Health => If it's down, wait for it to become healthy. [1] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ |
Remember if the endpoint does not respond within the time allowance, k8s will take |
/health/ready --- should test dependent database connections etc |
Thanks @Emily-Jiang, I'm happy with the types of things you describe for readiness and liveness checks and the subsequent actions. |
This maps directly to Kubernetes liveness and readiness probe but the general idea may be covered by the Healthcheck API.
There are different types of health checks performed by Kubernetes:
Typically, the readiness probe will return FAIL when the application is initializing itself.
Once it is initialized (and ready to server requests), the probe will return OK.
The liveness probe may return OK during initialization (no problem have been detected) but the application is still not ready to be used.
I propose that we add annotations (and corresponding HTTP endpoints) to the spec and API.
For example:
@Ready
annotation could be used to identify a health check procedure checking the readiness of the application. It would returnDOWN
when the application is not ready to serve request (because it is initializing or its initialization has failed, etc.)@Health
(or@Live
) would identify a health check procedure checking the liveness of the application (Any health check procedure without one of these annotation would be considered as a@Health
-y one) It would returnDOWN
if a problem has been detected by the producer.Each of these annotations would have an HTTP endpoint associated with it:
/health
would query@Health
procedures only=> a
DOWN
global outcome means that the application is not healthy/ready
would query@Ready
procedures only.=> a
DOWN
global outcome means that the application is not ready (yet) to serve its purpose (but that could change over time)The text was updated successfully, but these errors were encountered: