Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DWD not working as expected on some infrastructures where traffic to LB stays in-cluster (gets rerouted internally) #85

Closed
vlerenc opened this issue May 22, 2023 · 15 comments · Fixed by #94
Assignees
Labels
kind/bug Bug priority/1 Priority (lower number equals higher priority)

Comments

@vlerenc
Copy link
Member

vlerenc commented May 22, 2023

Reminder for a finding recently. Details missing here.
cc @unmarshall @ashwani2k

If memory serves, @ScheererJ and colleagues observed/noted that DWD will not actually reach out to/through the LB on all other infrastructures but AWS, which would make the test (API server reachable externally) not work as expected (reduced to an internal case, which cuts out problems like those for which we implemented the feature in the first place, e.g. a broken Istio ingress-gateway).

@vlerenc vlerenc added the kind/bug Bug label May 22, 2023
@ScheererJ
Copy link
Member

To provide a little more context:

Dependency watchdog tries to check the API server via its extern domain name. Depending on the infrastructure, the corresponding kubernetes service has either a domain name (AWS) or an IP address (other infrastructures). In case the kubernetes service includes an IP address kube-proxy (or any other component responsible for service routing) may short-cut the connection to the internal endpoint meaning the istio-ingress-gateway. (As service routing is often done based on IP addresses there is no short-cut if the kubernetes service uses a domain name.)
This results in a different path being taken than what was expected.

On AWS it looks like this:

DWD -> AWS Loadbalancer -> Istio Ingress Gateway -> Kube API Server

On other infrastructures:

DWD -> Istio Ingress Gateway -> Kube API Server

It is alway different from the internal dependency watchdog probe:

DWD -> Kube API Server

Depending on what is desired, this might or might not be sufficient. Loadbalancer issues can only be found on infrastructures with domain names in the kubernetes service (AWS). Istio issues should be visible everywhere. Problems in the shoot network, e.g. firewall rules, or between shoot and seed are never detectable in this approach.

@unmarshall
Copy link
Contributor

unmarshall commented May 31, 2023

MOM

Attendees: @ashwani2k, @ScheererJ, @unmarshall and @rishabh-11

We discussed the issue of kube-proxy path optimization in context of DWD and @ScheererJ has already left a comment above describing the problem that we see today. It seems that this issue has been there for a while now since kube-proxy path optimization is not a new feature.

Possible solutions that we discussed:

NOTE: These are only brain-storming ideas and its possible that some ideas are just not practical or do not make sense. Purpose is to only capture what was discussed.

Option-1
We already have blackbox-exporter pod running in the kube-system namespace in the shoot cluster. Thanks to @istvanballok I got a better understanding on how it communicates to the Kube-ApiServer running in the control plane of the shoot cluster. Istvan was kind enough to explain it with the following diagram (it might not be complete but gives an overall picture quite nicely):
image

There are essentially two ways a pod running in the shoot can reach its KAPI:

  1. Use the default kubernetes service which runs in the default namespace. You can also get more info about this service here.
  2. Use the external DNS name for the KAPI.

blackbox-exporter currently makes use of (1). As shown in the diagram Prometheus running in the control plane namespace of the shoot (in the seed) makes a call to the KAPI, which then redirects the call via a VPN tunnel to eventually reach the blackbox-exporter pods which make calls to the KAPI via the default k8s service (via PROXY).

NOTE: On the same lines we also have another component network-problem-detector which also runs in kube-system namespace of the shoot.

Pros:

  • We already have a component which is running in the shoot cluster which can be leveraged to know if KAPI is reachable from the shoot.

Cons:

  • There are several points of failure - VPN (which today is not HA), since default k8s service uses a different port it is possible that customers can block that port and of-course the blackbox-exporter can also be down (which would happen in non-HA cluster setup). It is also possible that there are network issues at the node level (local to a node) but that will result in no response from blackbox-exporter.
  • The impact of depending on blackbox-exporter needs to be well understood - even in case of false positives DWD will scale down KCM, MCM and CA. Today no such impact is seen because the findings from blackbox-exporter are only consumed by Prometheus.
  • If VPN pod is healthy but the network is broken then there is no automated replacement/restart of the VPN pod. This would lead to a longer connectivity downtime leading to DWD thinking that KAPI is now longer reachable and it will trigger a scale down.

Option-2

@ScheererJ suggested as an alternative to do something similar to what a BTP availability service does. Essentially try and have a probe running in another seed instead of a shoot. The issue is that this would not check the network path from the shoot to its control plane KAPI. However, even today (prior to knowing that the external probe does not work) - DWD was only checking the route from DWD -> LB -> Istio Ingress Gateway -> KAPI which is also not an accurate representation of the network path from the shoot. Since I do not have much information on what BTP availability service does exactly i will not mention the pros and cons. More details are sought.

Option-3

Have a brand new component that is setup in a HA manner and runs possible in the kube-system namespace in the shoot cluster. It will periodically update a lease but with much more frequency than the kubelet does (refer nodeLeaseDurationSeconds in https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/ ). DWD watches the leases and if it finds that many leases (defined via a threshold) have expired and no renewal is happening for a timeout time then it will start to scale down KCM, CA and MCM.

This option was only mentioned briefly but not completely thought through.


Overall it is clear that we currently do not have a fool proof way to know if the connectivity to KAPI from the shoot cluster is broken. Therefore we need to discuss and evaluate the next best thing considering what we have had till now was also far from perfect.

@ScheererJ, @ashwani2k, @rishabh-11 - feel free to add anything that i might have missed to capture or make corrections if something has been stated incorrectly or inaccurately.

@rishabh-11 rishabh-11 added the priority/1 Priority (lower number equals higher priority) label Jun 8, 2023
@unmarshall
Copy link
Contributor

unmarshall commented Jun 13, 2023

MOM

Attendees: @ashwani2k, @elankath, @rishabh-11, @unmarshall

All were briefed about the problem and pros and cons were discussed for all the options discussed and documented above.

Conclusion:

  • All of the above solution-options (which involve depending on an agent/component deployed on the shoot) are affected by variety of failures that can happen in the data plane (shoots) e.g. failure of VPN, node-local network failures, zonal network failures etcd. It is therefore hard for DWD to distinguish the reasons for non-reachability of KAPI from shoot components - be it network-problem-detector or black-box-exporter. These failures are out of control as they happen on the shoot cluster.
  • Evaluated the option to do the external DNS lookup from another DWD from a buddy seed. So essentially internal probe continues to be done locally and if that is successful then external query status is retrieved/pushed/updated from/by another DWD running in another seed. This ensures that external probe always hits the LB. However we also discussed the cons:
    • Needs at least 2 seeds for this to work.
    • Complexity involved in selecting the buddy seed.
    • When shoot control planes are migrated from one seed to another using CPM then this would also impact the seed selection for the KAPI for the migrated shoot.
    • It is possible that buddy seed is in another region and therefore this will add latency and additional chances of failure to even reach it or hear from it.
  • Look at the networking stack and see if kube-proxy can be prevented from path optimisations selectively for DWD. This will take some research and also would be an issue when proxy-less routing is done via cilium where kube-proxy is not involved. When that happens this solution would not work any longer.

We concluded that we will look further into checking if kube-proxy path optimisations can be applied conditionally.

@elankath
Copy link

/assign @elankath

@vlerenc
Copy link
Member Author

vlerenc commented Jun 28, 2023

And if not, we can accept @ScheererJ summary:

Depending on what is desired, this might or might not be sufficient. Loadbalancer issues can only be found on infrastructures with domain names in the kubernetes service (AWS). Istio issues should be visible everywhere. Problems in the shoot network, e.g. firewall rules, or between shoot and seed are never detectable in this approach.

At least Istio issues are detected and that was a, if not the main motivation we had/have.

@vlerenc
Copy link
Member Author

vlerenc commented Jun 28, 2023

If there is no simple way to avoid path optimisations and we do not want to accept the above, I actually like option 3 the most (new component in shoot, leases checked by DWD) as it most closely mimics what also the kubelet does.

@MartinWeindel
Copy link
Member

With respect to option 3 there was also an enhancement KEP for that purpose: kubernetes/enhancements#589
With Kubernetes 1.17 node leases have been introduced (GA). Are they not sufficient for that purpose?

  • The kubelet creates and then updates its Lease object every 10 seconds
    (the default update interval). Lease updates occur independently from the
    NodeStatus updates.

Example:

k -n kube-node-lease get leases

@unmarshall
Copy link
Contributor

unmarshall commented Aug 31, 2023

With Kubernetes 1.17 node leases have been introduced (GA). Are they not sufficient for that purpose?

As discussed during our meeting, we considering using node leases but the actor (kubelet) that updates the node lease is not very reliable. Kubelet can go down with a lot of other reasons. We wanted a very light weight component which registers a lease. Semantically it is similar to having a node lease and is in-fact a duplicate object per node that we propose to create. If we wish to revisit that decision then we can.

@ScheererJ
Copy link
Member

Is dependency watchdog's purpose not to prevent actions, e.g. node deletion, that are based on exactly this? Nodes are marked as NotReady based on the lease objects. Therefore, I find it quite logical for dependency watchdog to monitor this resource as well. It needs to act earlier than the other components, though.

@unmarshall
Copy link
Contributor

unmarshall commented Oct 20, 2023

We finalised on leveraging network-problem-detector as it already has functionality that probes using both pod and host network and we could enhance it by adding a Job (NWPD terminology, not a K8S Job) that periodically renews a lease. DWD listens for Node events and upon node registration it creates a dwd-node-lease. This is the same lease that will then be renewed by a specific go-routine (Job) inside NWPD using the host network of the node. Thanks to @MartinWeindel for the help and for the discussions.

However we learnt today that NWPD is deployed as an extension and that extensions can be disabled. Live issue#3891 suggests that one customer already did that. This makes it very hard to use NWPD as a component as for any cluster if this extension is disabled then renewals will never happen. This will be a false positive for DWD which will then scale down MCM, CA and KCM deployments to 0.

I checked with @timuthy and he confirmed that today there is no way to enforce an extension.

@timuthy
Copy link
Member

timuthy commented Oct 24, 2023

As discussed with @unmarshall and @ScheererJ, even if there was an option to enforce NWPD to be enabled for all shoots, we'd had a unwanted dependency from a Gardener core feature (DWD) and an extension (NWPD) that is not guaranteed to be installed in a landscape and theoretically exchangeable.

We mainly see two options at the moment:

  1. Make NWPD a main feature of gardener/gardener, so that DWD can piggy-back on the probes. There will probably never be an alternative implementation of NWPD and one can argue it's reasonable to offer this as an opt-out core feature for all shoots. If opted out, DWD proper will be disabled as well.
  2. DWD deploys its own DaemonSet which executes the probes and thus is independent from NWPD.

I personally prefer the second option, even though with a certain overhead but a clear segregation. Maybe the NWPD can even piggy-back in the DWD probes, once implemented?! (cc @MartinWeindel, @vlerenc)

@unmarshall
Copy link
Contributor

MOM
Attendees: @MartinWeindel, @timuthy, @ScheererJ, @elankath, @unmarshall, @rishabh-11
Summary:

  • Reason to consider using another lease (other than node lease) was to delink the discovery of network connectivity issue with a frequency lesser than that of node lease expiry period and also to not depend upon kubelet which can fail to update the lease for different reasons. One case pointed was if the Node is heavily loaded and kubelet is as a consequence overloaded. There have also been cases of kubelet crashes which have been reported upstream as well.
  • It was suggested that in the first iteration we can keep it simple and only look at node leases. Node leases do get renewed much before their expiry period. So DWD could set a threshold and watch if majority of the nodes are unable to renew their lease beyond the threshold but before the node lease expiry. If that is the case then it scales down KCM, MCM and CA. This will prevent multiple nodes going to Unknown state. For MCM and CA to react DWD has sufficient time health-timeout which is today set to 10mins by default.

Conclusion:
For the first iteration we go with watching node lease in DWD and bring down KCM, MCM and CA if majority of the nodes are delayed beyond a threshold to renew their lease. The caveat is that there is still a chance that majority of the nodes still cross the threshold but eventually do update their lease before their expiry. This case we will perhaps not be able to handle. But eventually it is expected that the delay beyond a threshold would not last and eventually DWD will be able to scale up KCM, MCM and CA.
We will have to use a configurable threshold as we do not know what would be a good value for it. Based on experience we will have to make adjustments.

@rfranzke
Copy link
Member

With gardener/gardener#8023, there is also the option to add a new controller to gardener-node-agent which can perform some network connectivity checks and report them somehow.

@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all issues.
This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Mark this issue as rotten with /lifecycle rotten
  • Close this issue with /close

/lifecycle stale

@gardener-prow gardener-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024
@unmarshall
Copy link
Contributor

/remove-lifecycle stale

@gardener-prow gardener-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug priority/1 Priority (lower number equals higher priority)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants