Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify warn status code #25

Merged
merged 1 commit into from
Oct 28, 2018
Merged

Clarify warn status code #25

merged 1 commit into from
Oct 28, 2018

Conversation

taliaga
Copy link
Contributor

@taliaga taliaga commented Oct 19, 2018

No description provided.

@taliaga taliaga mentioned this pull request Oct 19, 2018
status, HTTP response code in the 4xx-5xx range MUST be used. In case of the
“warn” status, endpoints SHOULD return HTTP status in the 2xx-3xx range, and
the HTTP response code returned by the health endpoint. For “pass” status,
HTTP response code in the 2xx-3xx range MUST be used. For “fail” status,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of giving here at least one specific [suggested] example code for each status case. Wdyt?
pass: 200
warn: 207?
fail: 424 maybe?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

207 is WebDAV. Not all web frameworks (clients or servers) will have an understanding of the WebDAV status codes.

This RFC should probably be conservative with the other HTTP-oriented RFCs it relies on.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: “warn” ... some people will put thresholds in their health checks, but, as a best practice, I personally will avoid putting the service metrics thresholds in my code, and I will put the thresholds in my external monitoring and control systems.

To me “warn” should be oriented towards “degraded” states. Think failing health checks that are neither liveness checks nor readiness checks. When liveness or readiness health checks are failing, I am “down.” When non-liveness and non-readiness health checks are failing, I am “warn”, which to me means “up and able to serve but degraded (relying on sane fallbacks or recording things for deferred batch operations upon service restoration, etc.).”

“warn” status, endpoints SHOULD return HTTP status in the 2xx-3xx range, and
the HTTP response code returned by the health endpoint. For “pass” status,
HTTP response code in the 2xx-3xx range MUST be used. For “fail” status,
HTTP response code in the 4xx-5xx range MUST be used. In case of the “warn”
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fail, returning a 5XX sounds like a paradox, doesn't it? The endpoint is working and responding with information so it should be an OK code (but it's telling you the service is unavailable) hehe. Also, I'm not sure, but isn't there a risk of having "response bodies with status information" dropped because of using a 5XX in some systems out there? What's your view?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on status codes only and entirely ignoring message bodies is already a well-established pattern in cloud-native autonomous control plane systems.

I’ll dig up some Kubernetes and Azure links where they talk about relying only on status codes from health/liveness/readiness endpoints for orchestration concerns.

Copy link
Owner

@inadarei inadarei Oct 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does indeed make more sense to use a 4xx error code when an API can respond and it is in 'fail' mode. However, if it cannot even respond, the response code will likely be a 5xx (just like @taliaga noted) so 5xx code is equivalent of 'fail'. To ensure there are no weird edge-cases it therefore follows that responding with a 5xx for some 'fails' should be ok. :) /shrug

@inadarei
Copy link
Owner

inadarei commented Oct 28, 2018

This PR looks good to me and should probably be merged. However, I noticed that @taliaga and @derekm had an interesting conversation about examples of http response code usage. If it's OK with everybody - I think we should separate that conversation from this one, to not delay acceptance of this change.

Sounds ok?

@derekm
Copy link

derekm commented Oct 28, 2018

Sure, I don’t think I raised any actual issues. @taliaga raised objections, but in contradiction to common practice.

I never did dig up those links, so I’ll do that now.

@derekm
Copy link

derekm commented Oct 28, 2018

Azure health: https://docs.microsoft.com/en-us/azure/architecture/patterns/health-endpoint-monitoring

Most existing tools and frameworks look only at the HTTP status code that the endpoint returns. To return and validate additional information, you might have to create a custom monitoring utility or service.

@derekm
Copy link

derekm commented Oct 28, 2018

Kubernetes health: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

@inadarei inadarei merged commit 4a80ba5 into inadarei:master Oct 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants