-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify warn status code #25
Conversation
status, HTTP response code in the 4xx-5xx range MUST be used. In case of the | ||
“warn” status, endpoints SHOULD return HTTP status in the 2xx-3xx range, and | ||
the HTTP response code returned by the health endpoint. For “pass” status, | ||
HTTP response code in the 2xx-3xx range MUST be used. For “fail” status, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
207 is WebDAV. Not all web frameworks (clients or servers) will have an understanding of the WebDAV status codes.
This RFC should probably be conservative with the other HTTP-oriented RFCs it relies on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: “warn” ... some people will put thresholds in their health checks, but, as a best practice, I personally will avoid putting the service metrics thresholds in my code, and I will put the thresholds in my external monitoring and control systems.
To me “warn” should be oriented towards “degraded” states. Think failing health checks that are neither liveness checks nor readiness checks. When liveness or readiness health checks are failing, I am “down.” When non-liveness and non-readiness health checks are failing, I am “warn”, which to me means “up and able to serve but degraded (relying on sane fallbacks or recording things for deferred batch operations upon service restoration, etc.).”
“warn” status, endpoints SHOULD return HTTP status in the 2xx-3xx range, and | ||
the HTTP response code returned by the health endpoint. For “pass” status, | ||
HTTP response code in the 2xx-3xx range MUST be used. For “fail” status, | ||
HTTP response code in the 4xx-5xx range MUST be used. In case of the “warn” |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fail
, returning a 5XX sounds like a paradox, doesn't it? The endpoint is working and responding with information so it should be an OK code (but it's telling you the service is unavailable) hehe. Also, I'm not sure, but isn't there a risk of having "response bodies with status information" dropped because of using a 5XX in some systems out there? What's your view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on status codes only and entirely ignoring message bodies is already a well-established pattern in cloud-native autonomous control plane systems.
I’ll dig up some Kubernetes and Azure links where they talk about relying only on status codes from health/liveness/readiness endpoints for orchestration concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does indeed make more sense to use a 4xx error code when an API can respond and it is in 'fail' mode. However, if it cannot even respond, the response code will likely be a 5xx (just like @taliaga noted) so 5xx code is equivalent of 'fail'. To ensure there are no weird edge-cases it therefore follows that responding with a 5xx for some 'fails' should be ok. :) /shrug
This PR looks good to me and should probably be merged. However, I noticed that @taliaga and @derekm had an interesting conversation about examples of http response code usage. If it's OK with everybody - I think we should separate that conversation from this one, to not delay acceptance of this change. Sounds ok? |
Sure, I don’t think I raised any actual issues. @taliaga raised objections, but in contradiction to common practice. I never did dig up those links, so I’ll do that now. |
Azure health: https://docs.microsoft.com/en-us/azure/architecture/patterns/health-endpoint-monitoring
|
Kubernetes health: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request
|
No description provided.