diff --git a/keps/prod-readiness/sig-node/2727.yaml b/keps/prod-readiness/sig-node/2727.yaml index 42f8eb3114c..29e01a188ee 100644 --- a/keps/prod-readiness/sig-node/2727.yaml +++ b/keps/prod-readiness/sig-node/2727.yaml @@ -3,3 +3,5 @@ alpha: approver: "@johnbelamaric" beta: approver: "@johnbelamaric" +stable: + approver: "@johnbelamaric" diff --git a/keps/sig-node/2727-grpc-probe/README.md b/keps/sig-node/2727-grpc-probe/README.md index 56a44760b4b..6b48eb1dda9 100644 --- a/keps/sig-node/2727-grpc-probe/README.md +++ b/keps/sig-node/2727-grpc-probe/README.md @@ -2,13 +2,19 @@ - [Release Signoff Checklist](#release-signoff-checklist) -- [Goals](#goals) -- [Non-Goals](#non-goals) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) - [Proposal](#proposal) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - - [Test Plan](#test-plan) - [Alternative Considerations](#alternative-considerations) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) - [Beta](#beta) @@ -23,11 +29,13 @@ - [Scalability](#scalability) - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) -- [Implementation History](#implementation-history-1) - [Alpha](#alpha-1) - [Beta](#beta-1) + - [GA](#ga-1) +- [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [References](#references) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -52,7 +60,25 @@ [kubernetes/kubernetes]: https://git.k8s.io/kubernetes [kubernetes/website]: https://git.k8s.io/website -## Goals +## Summary + +Add gRPC probe to Pod.Spec.Container.{Liveness,Readiness,Startup}Probe. + +## Motivation + +gRPC is wide spread RPC framework. Existing solutions to add +probes to gRPC apps like exposing additional http endpoint +for health checks or packing external gRPC client as part of +an image and use exec probes have many limitations and overhead. + +Many load balancers support gRPC natively so adding it to +Kubernetes aligns well with the industry. + +Finally, Kubernetes project actively uses gRPC so adding built-in +support for gRPC endpoints does not introduce any new dependencies +to the project. + +### Goals Enable gRPC probe natively from Kubelet without requiring users to package a gRPC healthcheck binary with their container. @@ -60,9 +86,9 @@ gRPC healthcheck binary with their container. - https://github.com/grpc-ecosystem/grpc-health-probe - https://github.com/grpc/grpc/blob/master/doc/health-checking.md -## Non-Goals +### Non-Goals -Add gRPC support in other areas of K8s (e.g. Services). +- Add gRPC support in other areas of K8s (e.g. Services). ## Proposal @@ -141,11 +167,6 @@ Note that `GRPCAction.Port` is an int32, which is inconsistent with the other existing probe definitions. This is on purpose -- we want to move users away from using the (portNum, portName) union type. -### Test Plan - -- Unit test: Add unit tests to `pkg/kubelet/prober/...` -- e2e: Add test case and conformance test to `e2e/common/node/container_probe.go`. - ### Alternative Considerations Note that `readinessProbe.grpc.service` may be confusing, some @@ -158,6 +179,47 @@ alternatives considered: There were no feedback on the selected name being confusing in the context of a probe definition. +### Test Plan + + + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + +- `k8s.io/kubernetes/pkg/probe/grpc`: `2023/02/06` - `78.1%` + +##### Integration tests + +N/A, only unit tests and e2e coverage. + +##### e2e tests + +Tests in `test/e2e/common/node/container_probe.go`: + +- should *not* be restarted with a GRPC liveness probe: [results](https://storage.googleapis.com/k8s-triage/index.html?test=Probing%20container%20should%20%5C*not%5C*%20be%20restarted%20with%20a%20GRPC%20liveness%20probe) +- should be restarted with a GRPC liveness probe: [results](https://storage.googleapis.com/k8s-triage/index.html?test=should%20be%20restarted%20with%20a%20GRPC%20liveness%20probe) + +TODO: stress test to validate the scale (see GA requirements). + ### Graduation Criteria #### Alpha @@ -177,12 +239,14 @@ Depending on skew strategy: #### GA -- Address feedback from beta usage -- Validate that API is appropriate for users. There are some potential tunables: +- [X] Address feedback from beta usage +- [X] Validate that API is appropriate for users. There are some potential tunables: - `User-Agent` - connect timeout - protocol (HTTP, QUIC) -- Close on any remaining open issues & bugs +- [ ] Close on any remaining open issues & bugs +- [ ] Promote tests to conformance +- [ ] Implement a stress test ### Upgrade / Downgrade Strategy @@ -198,38 +262,12 @@ Downgrade: gRPC probes will not be supported in a downgrade from Alpha. ## Production Readiness Review Questionnaire - - ### Feature Enablement and Rollback Feature enablement will be guarded by a feature gate flag. ###### How can this feature be enabled / disabled in a live cluster? - - - [x] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: `GRPCContainerProbe` - Components depending on the feature gate: `kubelet` (probing), API @@ -250,42 +288,26 @@ It becomes enabled again after the `kubelet` restart. ###### Are there any tests for feature enablement/disablement? -Y -es, unit tests for the feature when enabled and disabled will be +Yes, unit tests for the feature when enabled and disabled will be implemented in both kubelet and api server. ### Rollout, Upgrade and Rollback Planning - +We passed the version skew problem for the new API. No planning is required. ###### How can a rollout or rollback fail? Can it impact already running workloads? - +We passed the version skew problem - the API will be available on any supported +version skew. So no issues are expected with rollout and rollback. ###### What specific metrics should inform a rollback? - +Rollback wouldn't address issues. Pods will need to stop using the new probe +type. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - +N/A ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? @@ -357,8 +379,27 @@ The overhead of executing probes is consistent with other probe types. We expect decrease of disk, RAM, and CPU use for many scenarios where the https://github.com/grpc-ecosystem/grpc-health-probe was used to probe gRPC endpoints. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +Yes, gRPC probes use node resources to establish connection. +This may lead to issue like [kubernetes/kubernetes#89898](https://github.com/kubernetes/kubernetes/issues/89898). + +The node resources for gRPC probes can be exhausted by a Pod with HostPort +making many connections to different destinations or any other process on a node. +This problem cannot be addressed generically. + +However, the design where node resources are being used for gRPC probes works +for the most setups. The default pods maximum is `110`. There are currently +no limits on number of containers. The number of containers is limited by the +amount of resources requested by these containers. With the fix limiting +the `TIME_WAIT` for the socket to 1 second, +[this calculation](https://github.com/kubernetes/kubernetes/issues/89898#issuecomment-1383207322) +demonstrates it will be hard to reach the limits on sockets. + ### Troubleshooting +Logs and Pod events can be used to troubleshoot probe failures. + ###### How does this feature react if the API server and/or etcd is unavailable? No dependency on etcd availability. @@ -378,19 +419,6 @@ None ## Implementation History - - -## Implementation History - * Original PR for k8 Prober: https://github.com/kubernetes/kubernetes/pull/89832 * 2020-04-04: MR for k8 Prober * 2021-05-12: Cloned to this KEP to move the probe forward. @@ -404,6 +432,18 @@ Alpha feature was implemented in 1.23. Feature is promoted to beta in 1.24. +### GA + +Feature is promoted to GA in 1.27. + +## Drawbacks + +See [Motivation](#motivation) on why gRPC was picked as another RPC framework +to support natively. + +Adding gRPC is a small increment to k8s functionality with very little side +effects. But providing a lot of "quaity of life improvements" to gRPC apps. + ## Alternatives * 3rd party solutions like https://github.com/grpc-ecosystem/grpc-health-probe @@ -411,3 +451,11 @@ Feature is promoted to beta in 1.24. ## References * GRPC healthchecking: https://github.com/grpc/grpc/blob/master/doc/health-checking.md + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/2727-grpc-probe/kep.yaml b/keps/sig-node/2727-grpc-probe/kep.yaml index b394bfeea10..0e666273197 100644 --- a/keps/sig-node/2727-grpc-probe/kep.yaml +++ b/keps/sig-node/2727-grpc-probe/kep.yaml @@ -3,26 +3,27 @@ kep-number: 2727 authors: - "@bowei" - "@PxyUp" + - "@SergeyKanzhelev" owning-sig: sig-node participating-sigs: - sig-node - sig-network status: implementable creation-date: 2020-04-04 -last-updated: 2021-05-12 +last-updated: 2023-01-31 reviewers: - "@thockin" - "@mrunalp" - - "@SergeyKanzhelev" approvers: - "@thockin" - "@dchen1107" see-also: -stage: "beta" -latest-milestone: "v1.24" +stage: "stable" +latest-milestone: "v1.27" milestone: alpha: "v1.23" beta: "v1.24" + stable: "v1.27" feature-gates: - name: GRPCContainerProbe components: