-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3836: Improve Kube-proxy ingress connectivity reliability #3837
KEP-3836: Improve Kube-proxy ingress connectivity reliability #3837
Conversation
687b308
to
3fa87fa
Compare
/cc @aojea |
/cc @bowei |
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
workloads can be assigned to the Node anymore. It is true that it's usually | ||
followed up by a drain...but I am not sure it's good to implement something | ||
which can easily kill ingress connectivity to a Node / cluster by simply doing | ||
`kubectl cordon node ...` (It would kill ingress connectivity to the entire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ingress controller is eTP: Local
it will not go into draining mode (because that will naturally happen when the # of pods -> 0).
If the ingress controller is eTP: Cluster
the node will go into connection-draining, but the pod itself is still Ready, and other nodes can target it (cordon node A; external LB -> node B -> pod on node A). The connection draining is to gracefully end connections which come in through this node.
unschedulable
is imperfect, but the worst case is that this node stops acting as an LB backend for a while. If it becomes schedulable again, running connections are undamaged and new connections can resume.
We could say the same for node-unready (it's not a leading indicator for node going away, but could be an indicator for other trouble -- or not).
Basically, the cost of getting HC wrong for eTP: Cluster
is relatively low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have networkNotReady
, this seems to be Service Network Not Ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(because that will naturally happen when the # of pods -> 0).
I was talking about only cordoning the node in this case. I.e: not followed by a drain.
we have
networkNotReady
, this seems to be Service Network Not Ready
But that's tied to the network plugin, not the service proxy, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking about only cordoning the node in this case. I.e: not followed by a drain.
If the ingress controller is eTP: Local it will not go into draining mode - the HC will pass because this node still has endpoints. If you then drain the node, the HC will fail when pods -> 0. If you are not draining the node, there's no impact. So unschedulable
is fine.
If the ingress controller is eTP: Cluster the node will go into connection-draining, but the pod itself is still Ready, and other nodes can still target it (cordon node A; external LB -> node B -> pod on node A). If you then drain the node, the pod itself has termination grace period. If you are not draining the node, there's no impact. So unschedulable
is fine.
My point was that your concern "It would kill ingress connectivity to the entire cluster" is untrue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ingress controller is eTP: Local it will not go into draining mode - the HC will pass because this node still has endpoints.
The KEP says above that for eTP:Local the health should be based on both endpoints and kube-proxy health, doesn't it?
Though I can't actually come up with a scenario where that makes more sense than depending just on endpoint readiness/count...
One idea I had is that if kube-proxy has stopped getting updates from the apiserver, then it won't be able to know if the local endpoints are terminating/unready, so it might keep accepting connections, only to have the endpoint pod abruptly die. So in that case it would be good to have the LB stop sending to the pod when kube-proxy reports unhealthy. EXCEPT... kube-proxy doesn't actually report that it's unhealthy if it stops getting updates from the apiserver, so...
EDIT: oh, right, I was only thinking about cases where you had a working endpoint and then kube-proxy stopped working, but the problem is the other way, as Alexander explains below in Motivation: we don't want to claim an eTP:Local service is healthy/available if it has local endpoints but kube-proxy has not managed to create the rules forwarding to it.
Though I guess that doesn't affect the "draining" case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP says above that for eTP:Local the health should be based on both endpoints and kube-proxy health, doesn't it?
So, yes: like Dan said. This is what this KEP aims to change. And in that case we fail the HC when the node only is cordoned (unschedulable), causing all new connections to the ingress controller sitting on that Node to fail. And people might choose to put a node in the unschedulable
state for a variety of reasons, many of which I can't even imagine while I'm writing this. That's why I find it scary.
I think my company actually does that when we upgrade Kube at one point. We manage the node pools where our workloads run and we do this in manual mode: so when we need to upgrade them we create a new node pool with version N + 1 , then we cordon all existing Nodes in the cluster, but don't evict them, we call some service which will evict them later. But we expect ingress connectivity to work on the unschedulable nodes until that eviction service has kicked in a decided to trigger a restart (which might be minutes / hours)....killing ingress to these workloads for that time, would be bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really hate to be advocating for this, but.....what if we use the ToBeDeletedTaint...
for this? I mean; we are really talking about CA downscaling here and trying to find closely related fields on the Node object which the CA also manipulates during downscaling for indicating "the node is going away". But the most explicit field for that right now is ToBeDeletedTaint...
it has less ambiguity to it than unschedulable
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some crossed wires. We have two different types of health signals:
- the kube-proxy's own health (currently based on recency of DP updates).
- whether a given service has andpoints on this node or not.
We (well, some clouds) use the first as LBHC for eTP:Cluster
services.
Presumably everyone uses the second as LBHC for eTP:Local
.
Agree so far?
I'm advocating, first and foremost, that we expand eTP:Cluster
LBHC to include some other "eager" signal (e.g. unschedulable
) that the node is likely to go away soon. This has no bearing on the availability of the pods on that node, only on "thru traffic" which uses the node as a 2nd-hop LB.
I'm advocating, though less strenuously, that we expand eTP:Local
LBHC to include other indication(s) of the node's ability to handle traffic (e.g. DP recency). I am NOT advocating to include unschedulable
for this. This could impact the availability of the pods on that node, so we must choose wisely (or choose to ONLY consider endpoints-available).
IOW:
func etpClusterHealth() bool {
return dataplaneHealthy() && !nodeGoingAway()
}
func etpLocalHealth(svc string) bool {
return haveEndpoints(svc) && dataplaneHealthy()
}
nodeGoingAway()
can be as simple as return node.spec.unschedulable
but there's no reason not to make it consider more, I think.
- unschedulable is often followed by drain and delete
- ToBeDeleted is usually followed by drain and delete, but is also usually set at the same time as unschedulable
- deletionTimestamp clearly indicates impending death
- The "exclude from LB" label should cause the controller to remove the node, but we could also use it here
- ...
Does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense.
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
I'm +1 on the KEP, addressing the feedback from the comments But I strongly suggest tat this has to be tested and guarded with a feature gate and graduate based on the results of the tests, last change on the loadbalancer logic had to be reverted kubernetes/kubernetes#112807 and caused an important impact in GCP |
Freeze is < 48 hours. |
b86c38c
to
7d2a4d7
Compare
@thockin - fwiw, it's not even tracked for the release yet |
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
- Validate that TCP ingress connections are gracefully terminated when: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify that those tests will be added (they don't yet exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can not test graceful termination since that will depend on the LB implementation ... we can test that a drained node will not receive new connections of a loadbalancers (its healthcheck will start to fail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are very difficult tests to automate and require flipping nodes to unschedulable, waiting for HCs to fail, and then coming in through an external LB. Easier to do manually, but not as robust. Do standard e2es allow this sort of disruption?
@aojea can help design good tests here. :)
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
- Have kube-proxy accurately reflect its `healthz` state in its response to LB HCs | ||
for `eTP:Local` services. | ||
|
||
- Offer a better capability of connection draining terminating Nodes, for load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed on slack, I am concerned that changing the current /healthz behavior will have bad effect.
kube-proxy's /healthz serves two distinct purposes. First, it indicates whether the kube-proxy is "healthy" (according to our internally defined metric of data-plane staleness). Some people run kube-proxy as a Pod, and presumably point the livenessProbe
at /healthz (though it looks like kubeadm doesn't set up liveness for kube-proxy).
Second, we point cloud LB healthchecks at /healthz to indicate "is this node a suitable 1st-hop backend (basically a router) for this LB?".
We're talking about changing /healthz to consider both data-plane staleness AND some other thing(s) that would indicate this node is NOT a suitable 1st-hop backend (whether we use unschedulable
or the ToBeDeleted
taint or even node deletionTimestamp
).
Changing /healthz will affect both of the above. Do we really want kube-proxy to be restarted when the node becomes unschedulable
? That doesn't seem right. A node drain might take quite a while, and during that time the kube-proxy will fail liveness repeatedly and be restarted.
The advantage of making /healthz consider unschedulable
(or whatever) is that clouds already point LBs at /healthz, so they will all get better "for free".
The alternative is to leave /healthz as it is and add /lbhealthz or something which is specifically geared for LB HCs. The downside is that cloudf code needs to be updated and every existing LB needs to be updated to use it.
Another option could be to add a /livenessz URL with the "older" semantics and get people to use that for liveness. That's strictly a break, but I can't honestly tell if anyone sets up liveness for kube-proxy. kubeadm does not. GKE does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some people run kube-proxy as a Pod, and presumably point the
livenessProbe
at /healthz (though it looks like kubeadm doesn't set up liveness for kube-proxy).
we pushed back hard on that kubernetes/kubernetes#75323 ;) , but is a possibility indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that argues for a calculated risk - using healthz for LBs is better all around and the risk of it breaking seems low.
How does this sound:
- refine /healthz as described in this KEP
- add /livez with the "historical" semantic
- (maybe) add /lb-healthz as an alias to /healthz with guidance to prefer this URL for clarity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it's worthwhile for us to go through all the hoops of doing points:
- add /livez with the "historical" semantic
- (maybe) add /lb-healthz as an alias to /healthz with guidance to prefer this URL for clarity
As you said: cloud-providers need to change their HC to benefit from this improvement (which many could argue, is in itself small/an optimization). And cloud providers won't change their HC per-service-proxy (i.e one URL for Kube-proxy, another for Cilium, etc). So for any of that to bare fruit we need to also recommend and convince other service proxies to do the same....that feels so far away that it's practically impossible.
Given that I feel we just do:
- refine /healthz as described in this KEP
and allow the outcome of calculated risks to surface when the KEP progresses through the different stages. If someone is configuring their Kube-proxy with a livenessProbe targeting /healthz
, then we can decide to drop the feature then. This should, at worst, happen no later than Beta when the feature is on by default for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think the risk is so significant, we can even extend the Beta phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of it: this issue should quite quickly surface. Nodes go unschedulable/tainted all the time, and if the effect of that is a CrashLooping Kube-proxies: someone is likely to report that....I hope :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I like /livez is because it's low-effort and gives a clear path for setting up livenessProbe
for kube-proxy (which we really SHOULD have). And it doesn't need clouds to update anything except DaemonSet configs.
It's sort of a "while we're here, let's fix this" opportunity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's a better argument I guess :)
I will try to add that tomorrow morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding livez/ as an additional thing in this KEP
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
- Validate that TCP ingress connections are gracefully terminated when: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are very difficult tests to automate and require flipping nodes to unschedulable, waiting for HCs to fail, and then coming in through an external LB. Easier to do manually, but not as robust. Do standard e2es allow this sort of disruption?
@aojea can help design good tests here. :)
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
7d2a4d7
to
eb03260
Compare
@ all: I've updated the KEP: main changes include:
|
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
#### e2e tests | ||
|
||
- Add E2E tests for connectivity to services on terminating nodes and | ||
validate graceful termination of TCP connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea to elaborate on approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments are minor now
Setting that field is usually followed by eviction and Node termination, but | ||
it doesn't have as strong of a direct link to the termination/deletion of the | ||
Node, as the taint does. Users can for example decide to cordon all nodes | ||
during at given moment in time. Using `.spec.unschedulable` as this signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can debate this in more detail later (doesn't need to block KEP, I think), but ONLY handling CA's taint means we don't get any benefit if not using CA. I just don't think that is sufficient.
As of now we do not have a clear answer to the question we are asking, so we will have to approximate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
it would be good to add figuring this out as a GA criteria maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we should figure out the field before going to Beta right? We don't want to enable things by default and not be sure that it's right, right?
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml
Outdated
Show resolved
Hide resolved
/approve SIG-approved - @wojtek-t if you have time for PRR, if not, mea culpa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor comments from me.
I will wait for updates done (especially the livez one) and approve later today.
Setting that field is usually followed by eviction and Node termination, but | ||
it doesn't have as strong of a direct link to the termination/deletion of the | ||
Node, as the taint does. Users can for example decide to cordon all nodes | ||
during at given moment in time. Using `.spec.unschedulable` as this signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
it would be good to add figuring this out as a GA criteria maybe?
- Have kube-proxy accurately reflect its `healthz` state in its response to LB HCs | ||
for `eTP:Local` services. | ||
|
||
- Offer a better capability of connection draining terminating Nodes, for load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding livez/ as an additional thing in this KEP
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
eb03260
to
6d6952c
Compare
Updated again, changes:
What I didn't address:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor nits - but I'm approving so that you're not blocked on me, given the deadline later today.
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Outdated
Show resolved
Hide resolved
/approve PRR |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexanderConstantinescu, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6d6952c
to
e3d8ada
Compare
/lgtm |
/sig network
/assign @thockin @danwinship