Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide different types of health check #35

Closed
jmesnil opened this issue Jul 7, 2017 · 29 comments
Closed

Provide different types of health check #35

jmesnil opened this issue Jul 7, 2017 · 29 comments

Comments

@jmesnil
Copy link
Contributor

jmesnil commented Jul 7, 2017

This maps directly to Kubernetes liveness and readiness probe but the general idea may be covered by the Healthcheck API.

There are different types of health checks performed by Kubernetes:

  • readiness (when an application is up and ready to serve requests)
  • liveness (when an application is up and running without any issue)

Typically, the readiness probe will return FAIL when the application is initializing itself.
Once it is initialized (and ready to server requests), the probe will return OK.
The liveness probe may return OK during initialization (no problem have been detected) but the application is still not ready to be used.

I propose that we add annotations (and corresponding HTTP endpoints) to the spec and API.
For example:

  • the @Ready annotation could be used to identify a health check procedure checking the readiness of the application. It would return DOWN when the application is not ready to serve request (because it is initializing or its initialization has failed, etc.)
  • the @Health (or @Live) would identify a health check procedure checking the liveness of the application (Any health check procedure without one of these annotation would be considered as a @Health-y one) It would return DOWN if a problem has been detected by the producer.

Each of these annotations would have an HTTP endpoint associated with it:

  • /health would query @Health procedures only
    => a DOWN global outcome means that the application is not healthy
  • /ready would query @Ready procedures only.
    => a DOWN global outcome means that the application is not ready (yet) to serve its purpose (but that could change over time)
@rsvoboda
Copy link
Contributor

As per #29 proposal would you map different types of health checks to /health/<deployment name>/ready and /health/<deployment name>/health (or /health/<deployment name>/live) ?

Could be /health and /ready endpoints exposed via deployment url, something like https://my.service.cloud/my-app-001/ready ?

@heiko-braun
Copy link
Contributor

heiko-braun commented Jul 10, 2017

@rsvoboda I think your later proposal host:port/app-context/ready is more suitable and closer to the EE deployment models. Not that we should put this first, but compatibility with EE should be a goal

@jmesnil
Copy link
Contributor Author

jmesnil commented Aug 1, 2017

As discussed in August 1st Call, this issue should be postponed to 1.1 as 1.0 should focus on delivering a single /health endpoint.
Any other type of "health" (such as /ready) can be added in 1.1 or later with a proper set of annotations.

@cescoffier
Copy link
Contributor

It's probably time to revisit this issue. Having 2 endpoints is crucial in Kubernetes environment as liveness and readiness checks are different and serve different purposes. They may likely use a different set of procedures.

@kenfinnigan
Copy link

+1 to revisiting.

In terms of endpoints, I like the idea of /health and '/ready' to separate them.

Although maybe it needs to go further and retain the existing /health for detailed health information, and have new /ready and /live endpoints, that may re-use some of the checks defined for /health. To not pollute the namespace they could be sub endpoints of /health, so /health/ready and /health/live

@cescoffier
Copy link
Contributor

I think I prefer the /health/ready and /health/alive (live or alive?) endpoints, so everything is under the /health context.

Now the question is what should /health return? A composition of both readiness and liveness checks?

@kenfinnigan
Copy link

Maybe /health is a rollup of all health checks, and there's separate annotations to indicate which are specifically @Ready and @Alive checks?

@Ladicek
Copy link

Ladicek commented Jun 20, 2018

One concern with /health being a logical conjunction of all existing health checks: once you start adding @Ready and @Alive health checks, where typically an app can be alive but not ready, the conjunction is basically equivalent of "ready". In such situation, if you configure a liveness check to point to /health, you have a problem.

It's not as clean, but it might be better to treat existing @Health checks, presented at /health, as liveness, and only introduce one new concept: readiness. /health would be a logical conjuction of all @Health checks (like it's today), and something like /health/readiness would be a logical conjunction of all @Health checks and all @Readiness checks (that is, readiness would be a superset of liveness).

Thoughts?

@OndroMih
Copy link

@Ladicek: In such situation, if you configure a liveness check to point to /health, you have a problem.

What do you mean by you have a problem? I don't see any problems yet. I assume that "readiness" checks are very cheap as they are expected to be called very often during a start-up phase. I also assume that all "readiness" checks are also very simple "liveness" checks, so it shouldn't be a problem to include them in "liveness" checks. Do you have any examples where a "readiness" check wouldn't make sense as a "liveness" check?

@Ladicek
Copy link

Ladicek commented Jul 23, 2018

If that was true, then you wouldn't need to distinguish between liveness and readiness at all. My favorite example to illustrate the difference between liveness and readiness is a vital external service, e.g. a database. If the application can't reach its database, it is definitely not ready (= can't accept requests), but may very well be alive (= doesn't need to be restarted). If this readiness check would be used as a liveness check as well, then the application would get needlessly restarted. Another possibility is that becoming ready may take a while (populating internal caches or whatnot) -- again, the application is alive but not ready, and if readiness checks are used as liveness checks, the application will be restarted as not alive, actually preventing it from reaching the ready state. Makes sense?

@Emily-Jiang
Copy link
Member

Can we fix this issue for Health Check 1.1 release?

@antoinesd
Copy link
Contributor

Can we fix this issue for Health Check 1.1 release?

Introduction of /liveness and /readiness is planned for 1.1 release.

For backward compatibility I suggest that we keep /health (while deprecating its usage).However we shouldn't change its current behavior again for backward compatibility sake.

@Ladicek
Copy link

Ladicek commented Aug 27, 2018

To keep the URL space occupied by MP as small as possible, I'd suggest not introducing /liveness and /readiness, but instead /health/liveness and /health/readiness. (And of course I agree that /health should be equivalent to /health/liveness.)

@antoinesd
Copy link
Contributor

@Ladicek I understand the url space saving requirement, but wouldn't it be confusing to have /health/liveness, /health/readiness and a part of these url /health as endpoints.

What about /health, /hc/liveness and /hc/readiness ?

@Ladicek
Copy link

Ladicek commented Aug 27, 2018

I don't really have strong feelings about this, and I agree it might be confusing.

@Emily-Jiang
Copy link
Member

I prefer the end point of /health, /health/ready and /health/live
and also suggest /health being the combination of /health/ready and /health/live.

@Ladicek
Copy link

Ladicek commented Aug 28, 2018

Agree that /health/live is much better than /health/liveness. Same for ready/readiness.

Very much disagree that /health should be a combination of /health/live and /health/ready, but I think I already explained my opinion on that above.

@cescoffier
Copy link
Contributor

I would also prefer having everything under /health. I don't mind dropping the ness suffix. However, should make it configurable?

@kenfinnigan
Copy link

+1 to everything under /health.
+1 to dropping ness.
+1 to endpoints being configurable, including the root /health.
+1 to /health not being a combination of sub endpoints for purely backwards compatible reasons.

@antoinesd
Copy link
Contributor

antoinesd commented Sep 6, 2018

Started PR in these directions

@ebullient
Copy link

For CF (single endpoint), behavior of "health check" (avoiding URL) is readiness. As soon as the check returns a 200, requests are routed. In this case, you configure a delay (don't look for ok health responses for some interval after process start). If the health check returns anything else, the process is restarted.

This means, in practice, that a CF health check == readiness.

Kubernetes can also be configured with initial delays. You can get by in Kubernetes by only specifying a readiness check with an internal delay (though less than ideal).

We need to remember that these services are transient things, and avoid being overly-sensitive to killing and restarting the process. If it isn't ready, it most likely should die (unless it is already dying), as that is the safest way to re-establish connections and return a process to health.

@Ladicek
Copy link

Ladicek commented Sep 21, 2018

Yea, well, there's probably a reason why liveness and readiness are two distinct concepts in Kubernetes. I have to vehemently disagree with

If it isn't ready, it most likely should die

That's not true. If it isn't alive, it should die. There are legitimate cases when a service is alive but not yet ready.

@ebullient
Copy link

ebullient commented Sep 21, 2018

"service is alive but not yet ready." -- yes, which is why the (albeit fragile) initialization delay exists. Kubernetes liveness/readiness definitely makes this part more robust / less fragile (and even then, properly setting *.initialDelaySeconds is a thing)

Assuming something reached ready state, if it subsequently stops being ready, I stand by what I said re: safety. Restarting is the most generally applicable/safest way to recover from problems (equivalent to turning it off and back on again)

@arthurdm
Copy link
Member

+1 to this feature. will be great to distinguish between the two cases (readiness and liveness). We will take advantage of this in our helm chart.

@gunnarmorling
Copy link

gunnarmorling commented Oct 29, 2018

In terms of URLs, I'd like to see /healthremaining for liveness probes and health/ready being added for readiness probes. Doesn't consume more URL space than before while allowing to tell apart the two sorts of probes.

@gcharters
Copy link

gcharters commented Nov 23, 2018

My apologies if this thread does not match latest thinking, but here's my 2c regarding the mapping of Liveness vs Readiness to MP Health because some of the above discussion doesn't fit my understanding.

When I look at the definition for Livenss & Readiness in the Kubernetes docs [1] it says:

The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs.

The kubelet uses readiness probes to know when a Container is ready to start accepting traffic. A Pod is considered ready when all of its Containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

With the current MP Health examples I've seen, they've talked about testing database connections or service dependencies. If the test of the connection or service fails, we return "DOWN" for the Health. That, to my mind, most closely matches Kubernetes' definition of Readiness. The app and server are functioning as expected but aren't ready to receive work. No amount of container restarting will fix things.

Regarding Liveness, I'm a bit skeptical as to whether an application can say whether it's alive or not. If it's dead, it can't say so. :/ . Kubernetes Liveness feels like a lower level concern. A bit like a, can I ping my server to see if it's there, and if it's not, I'll restart the container in the hope it comes back. Maybe I'm alone in this view.

I would therefore assert:

Readiness => Current MP Health => If it's down, wait for it to become healthy.
Liveness => Something TBD => If it's down, restart the container.

[1] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

@Emily-Jiang
Copy link
Member

@gcharters

Regarding Liveness, I'm a bit skeptical as to whether an application can say whether it's alive or not. If it's dead, it can't say so. :/ . Kubernetes Liveness feels like a lower level concern. A bit like a, can I ping my server to see if it's there, and if it's not, I'll restart the container in the hope it comes back. Maybe I'm alone in this view.

Remember if the endpoint does not respond within the time allowance, k8s will take down as the answer. The liveness check will be checking on the jvm or system, e.g. running out of memory... no more thread left, etc.

@Emily-Jiang
Copy link
Member

/health/ready --- should test dependent database connections etc
/health/live -- should test the system is health, e.g. not running out memory..

@gcharters
Copy link

Thanks @Emily-Jiang, I'm happy with the types of things you describe for readiness and liveness checks and the subsequent actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests