Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3836: Improve Kube-proxy ingress connectivity reliability #3837

Merged
merged 1 commit into from
Feb 9, 2023

Conversation

alexanderConstantinescu
Copy link
Member

  • One-line PR description: This KEP proposes a more reliably mechanism for handling ingress connectivity for endpoints on terminating nodes and for endpoints on nodes with un-healthy Kube-proxies.
  • Other comments:

/sig network
/assign @thockin @danwinship

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Feb 3, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 3, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 3, 2023
@bowei
Copy link
Member

bowei commented Feb 4, 2023

/cc @aojea

@bowei
Copy link
Member

bowei commented Feb 4, 2023

/cc @bowei

workloads can be assigned to the Node anymore. It is true that it's usually
followed up by a drain...but I am not sure it's good to implement something
which can easily kill ingress connectivity to a Node / cluster by simply doing
`kubectl cordon node ...` (It would kill ingress connectivity to the entire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ingress controller is eTP: Local it will not go into draining mode (because that will naturally happen when the # of pods -> 0).

If the ingress controller is eTP: Cluster the node will go into connection-draining, but the pod itself is still Ready, and other nodes can target it (cordon node A; external LB -> node B -> pod on node A). The connection draining is to gracefully end connections which come in through this node.

unschedulable is imperfect, but the worst case is that this node stops acting as an LB backend for a while. If it becomes schedulable again, running connections are undamaged and new connections can resume.

We could say the same for node-unready (it's not a leading indicator for node going away, but could be an indicator for other trouble -- or not).

Basically, the cost of getting HC wrong for eTP: Cluster is relatively low.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have networkNotReady , this seems to be Service Network Not Ready

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(because that will naturally happen when the # of pods -> 0).

I was talking about only cordoning the node in this case. I.e: not followed by a drain.

we have networkNotReady , this seems to be Service Network Not Ready

But that's tied to the network plugin, not the service proxy, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about only cordoning the node in this case. I.e: not followed by a drain.

If the ingress controller is eTP: Local it will not go into draining mode - the HC will pass because this node still has endpoints. If you then drain the node, the HC will fail when pods -> 0. If you are not draining the node, there's no impact. So unschedulable is fine.

If the ingress controller is eTP: Cluster the node will go into connection-draining, but the pod itself is still Ready, and other nodes can still target it (cordon node A; external LB -> node B -> pod on node A). If you then drain the node, the pod itself has termination grace period. If you are not draining the node, there's no impact. So unschedulable is fine.

My point was that your concern "It would kill ingress connectivity to the entire cluster" is untrue.

Copy link
Contributor

@danwinship danwinship Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ingress controller is eTP: Local it will not go into draining mode - the HC will pass because this node still has endpoints.

The KEP says above that for eTP:Local the health should be based on both endpoints and kube-proxy health, doesn't it?

Though I can't actually come up with a scenario where that makes more sense than depending just on endpoint readiness/count...

One idea I had is that if kube-proxy has stopped getting updates from the apiserver, then it won't be able to know if the local endpoints are terminating/unready, so it might keep accepting connections, only to have the endpoint pod abruptly die. So in that case it would be good to have the LB stop sending to the pod when kube-proxy reports unhealthy. EXCEPT... kube-proxy doesn't actually report that it's unhealthy if it stops getting updates from the apiserver, so...

EDIT: oh, right, I was only thinking about cases where you had a working endpoint and then kube-proxy stopped working, but the problem is the other way, as Alexander explains below in Motivation: we don't want to claim an eTP:Local service is healthy/available if it has local endpoints but kube-proxy has not managed to create the rules forwarding to it.

Though I guess that doesn't affect the "draining" case...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP says above that for eTP:Local the health should be based on both endpoints and kube-proxy health, doesn't it?

So, yes: like Dan said. This is what this KEP aims to change. And in that case we fail the HC when the node only is cordoned (unschedulable), causing all new connections to the ingress controller sitting on that Node to fail. And people might choose to put a node in the unschedulable state for a variety of reasons, many of which I can't even imagine while I'm writing this. That's why I find it scary.

I think my company actually does that when we upgrade Kube at one point. We manage the node pools where our workloads run and we do this in manual mode: so when we need to upgrade them we create a new node pool with version N + 1 , then we cordon all existing Nodes in the cluster, but don't evict them, we call some service which will evict them later. But we expect ingress connectivity to work on the unschedulable nodes until that eviction service has kicked in a decided to trigger a restart (which might be minutes / hours)....killing ingress to these workloads for that time, would be bad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hate to be advocating for this, but.....what if we use the ToBeDeletedTaint... for this? I mean; we are really talking about CA downscaling here and trying to find closely related fields on the Node object which the CA also manipulates during downscaling for indicating "the node is going away". But the most explicit field for that right now is ToBeDeletedTaint... it has less ambiguity to it than unschedulable....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some crossed wires. We have two different types of health signals:

  1. the kube-proxy's own health (currently based on recency of DP updates).
  2. whether a given service has andpoints on this node or not.

We (well, some clouds) use the first as LBHC for eTP:Cluster services.
Presumably everyone uses the second as LBHC for eTP:Local.

Agree so far?

I'm advocating, first and foremost, that we expand eTP:Cluster LBHC to include some other "eager" signal (e.g. unschedulable) that the node is likely to go away soon. This has no bearing on the availability of the pods on that node, only on "thru traffic" which uses the node as a 2nd-hop LB.

I'm advocating, though less strenuously, that we expand eTP:Local LBHC to include other indication(s) of the node's ability to handle traffic (e.g. DP recency). I am NOT advocating to include unschedulable for this. This could impact the availability of the pods on that node, so we must choose wisely (or choose to ONLY consider endpoints-available).

IOW:

func etpClusterHealth() bool {
    return dataplaneHealthy() && !nodeGoingAway()
}

func etpLocalHealth(svc string) bool {
    return haveEndpoints(svc) && dataplaneHealthy()
}

nodeGoingAway() can be as simple as return node.spec.unschedulable but there's no reason not to make it consider more, I think.

  • unschedulable is often followed by drain and delete
  • ToBeDeleted is usually followed by drain and delete, but is also usually set at the same time as unschedulable
  • deletionTimestamp clearly indicates impending death
  • The "exclude from LB" label should cause the controller to remove the node, but we could also use it here
  • ...

Does that help?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense.

@aojea
Copy link
Member

aojea commented Feb 5, 2023

I'm +1 on the KEP, addressing the feedback from the comments

But I strongly suggest tat this has to be tested and guarded with a feature gate and graduate based on the results of the tests, last change on the loadbalancer logic had to be reverted kubernetes/kubernetes#112807 and caused an important impact in GCP

@thockin
Copy link
Member

thockin commented Feb 7, 2023

Freeze is < 48 hours.

@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2023

I am ready to re-read when you push, just let me know.

@thockin - fwiw, it's not even tracked for the release yet
[we should lead opt-in if we're planning it for 1.27]


###### Are there any tests for feature enablement/disablement?

- Validate that TCP ingress connections are gracefully terminated when:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify that those tests will be added (they don't yet exist).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can not test graceful termination since that will depend on the LB implementation ... we can test that a drained node will not receive new connections of a loadbalancers (its healthcheck will start to fail)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very difficult tests to automate and require flipping nodes to unschedulable, waiting for HCs to fail, and then coming in through an external LB. Easier to do manually, but not as robust. Do standard e2es allow this sort of disruption?

@aojea can help design good tests here. :)

- Have kube-proxy accurately reflect its `healthz` state in its response to LB HCs
for `eTP:Local` services.

- Offer a better capability of connection draining terminating Nodes, for load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed on slack, I am concerned that changing the current /healthz behavior will have bad effect.

kube-proxy's /healthz serves two distinct purposes. First, it indicates whether the kube-proxy is "healthy" (according to our internally defined metric of data-plane staleness). Some people run kube-proxy as a Pod, and presumably point the livenessProbe at /healthz (though it looks like kubeadm doesn't set up liveness for kube-proxy).

Second, we point cloud LB healthchecks at /healthz to indicate "is this node a suitable 1st-hop backend (basically a router) for this LB?".

We're talking about changing /healthz to consider both data-plane staleness AND some other thing(s) that would indicate this node is NOT a suitable 1st-hop backend (whether we use unschedulable or the ToBeDeleted taint or even node deletionTimestamp).

Changing /healthz will affect both of the above. Do we really want kube-proxy to be restarted when the node becomes unschedulable ? That doesn't seem right. A node drain might take quite a while, and during that time the kube-proxy will fail liveness repeatedly and be restarted.

The advantage of making /healthz consider unschedulable (or whatever) is that clouds already point LBs at /healthz, so they will all get better "for free".

The alternative is to leave /healthz as it is and add /lbhealthz or something which is specifically geared for LB HCs. The downside is that cloudf code needs to be updated and every existing LB needs to be updated to use it.

Another option could be to add a /livenessz URL with the "older" semantics and get people to use that for liveness. That's strictly a break, but I can't honestly tell if anyone sets up liveness for kube-proxy. kubeadm does not. GKE does not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people run kube-proxy as a Pod, and presumably point the livenessProbe at /healthz (though it looks like kubeadm doesn't set up liveness for kube-proxy).

we pushed back hard on that kubernetes/kubernetes#75323 ;) , but is a possibility indeed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that argues for a calculated risk - using healthz for LBs is better all around and the risk of it breaking seems low.

How does this sound:

  • refine /healthz as described in this KEP
  • add /livez with the "historical" semantic
  • (maybe) add /lb-healthz as an alias to /healthz with guidance to prefer this URL for clarity

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it's worthwhile for us to go through all the hoops of doing points:

  • add /livez with the "historical" semantic
  • (maybe) add /lb-healthz as an alias to /healthz with guidance to prefer this URL for clarity

As you said: cloud-providers need to change their HC to benefit from this improvement (which many could argue, is in itself small/an optimization). And cloud providers won't change their HC per-service-proxy (i.e one URL for Kube-proxy, another for Cilium, etc). So for any of that to bare fruit we need to also recommend and convince other service proxies to do the same....that feels so far away that it's practically impossible.

Given that I feel we just do:

  • refine /healthz as described in this KEP

and allow the outcome of calculated risks to surface when the KEP progresses through the different stages. If someone is configuring their Kube-proxy with a livenessProbe targeting /healthz, then we can decide to drop the feature then. This should, at worst, happen no later than Beta when the feature is on by default for everyone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think the risk is so significant, we can even extend the Beta phase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of it: this issue should quite quickly surface. Nodes go unschedulable/tainted all the time, and if the effect of that is a CrashLooping Kube-proxies: someone is likely to report that....I hope :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I like /livez is because it's low-effort and gives a clear path for setting up livenessProbe for kube-proxy (which we really SHOULD have). And it doesn't need clouds to update anything except DaemonSet configs.

It's sort of a "while we're here, let's fix this" opportunity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's a better argument I guess :)

I will try to add that tomorrow morning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding livez/ as an additional thing in this KEP


###### Are there any tests for feature enablement/disablement?

- Validate that TCP ingress connections are gracefully terminated when:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very difficult tests to automate and require flipping nodes to unschedulable, waiting for HCs to fail, and then coming in through an external LB. Easier to do manually, but not as robust. Do standard e2es allow this sort of disruption?

@aojea can help design good tests here. :)

@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented Feb 8, 2023

@ all: I've updated the KEP:

main changes include:

#### e2e tests

- Add E2E tests for connectivity to services on terminating nodes and
validate graceful termination of TCP connections.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea to elaborate on approach

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are minor now

Setting that field is usually followed by eviction and Node termination, but
it doesn't have as strong of a direct link to the termination/deletion of the
Node, as the taint does. Users can for example decide to cordon all nodes
during at given moment in time. Using `.spec.unschedulable` as this signal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can debate this in more detail later (doesn't need to block KEP, I think), but ONLY handling CA's taint means we don't get any benefit if not using CA. I just don't think that is sufficient.

As of now we do not have a clear answer to the question we are asking, so we will have to approximate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

it would be good to add figuring this out as a GA criteria maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we should figure out the field before going to Beta right? We don't want to enable things by default and not be sure that it's right, right?

@thockin
Copy link
Member

thockin commented Feb 8, 2023

/approve

SIG-approved - @wojtek-t if you have time for PRR, if not, mea culpa.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comments from me.

I will wait for updates done (especially the livez one) and approve later today.

Setting that field is usually followed by eviction and Node termination, but
it doesn't have as strong of a direct link to the termination/deletion of the
Node, as the taint does. Users can for example decide to cordon all nodes
during at given moment in time. Using `.spec.unschedulable` as this signal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

it would be good to add figuring this out as a GA criteria maybe?

- Have kube-proxy accurately reflect its `healthz` state in its response to LB HCs
for `eTP:Local` services.

- Offer a better capability of connection draining terminating Nodes, for load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding livez/ as an additional thing in this KEP

@alexanderConstantinescu
Copy link
Member Author

Updated again, changes:

  • addition of /livez path for Kube-proxy

What I didn't address:

  • A more elaborated approach w.r.t how we'll do E2E testing for this - I think we can work out the specifics once we start that effort?

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor nits - but I'm approving so that you're not blocked on me, given the deadline later today.

@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2023

/approve PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2023
@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2023

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants