tainting and untainting logic implemented via configuration #565

bilalcaliskan · 2021-05-22T20:34:15Z

Which issue(s) this PR fixes:

this pr solve issue #457 .

What this PR does / why we need it:

This PR adds functionality of tainting and untainting a node on specific circumstances conditionally. With that improvement, node-problem-detector can be used in conjunction with descheduler. User should specify taintEnabled, taintKey, taintValue, taintEffect in config/kernel-monitor.json. If not specified, taintEnabled is false, so npd will not taint any node. With that improvement, node-problem-detector also removes taint if problem is resolved.

Special notes for your reviewer:

This improvement needs update Clusterrole to node-problem-detector. If that PR somehow merged to the master, update verb must be added to right here.

k8s-ci-robot · 2021-05-22T20:34:23Z

Hi @bilalcaliskan. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bilalcaliskan · 2021-05-22T20:52:23Z

i have modified the commit message so there are lots of activity since PR is opened, sorry for that.

bilalcaliskan · 2021-05-23T10:51:33Z

/retest

k8s-ci-robot · 2021-05-23T10:51:47Z

@bilalcaliskan: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bilalcaliskan · 2021-06-06T10:00:17Z

/retest

k8s-ci-robot · 2021-06-06T10:00:30Z

@bilalcaliskan: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bilalcaliskan · 2021-06-06T10:01:25Z

/assign @xueweiz @andyxning

Random-Liu · 2021-06-25T16:38:16Z

We discussed this long time back when we firstly started NPD.
The taint can be applied by NPD itself, and can also be applied by another controller that takes NPD generated conditions and events as signal.
Having another controller is the pattern most people use right now, because the cluster level controller should have a global view and a better understanding of whether it is OK to taint a node or not.
For example, if 5 out of 10 nodes have an issue, you may not want NPD on each node to taint the node, that may make a lot of pods nowhere to run. Instead, a controller could drain the bad nodes one by one and recreate/repair them.

azman0101 · 2021-06-30T08:36:39Z

According to the remedy system section, NPD is able to add condition to node that eventually will taint them, then descheduler will evict pods that doesn't respect the taint.

Or we could be taking advantage of Taint Based Eviction instead of descheduler.
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Currently, descheduler support RemovePodsViolatingNodeTaints and NPD can add condition nevertheless, there is no way to rely only those two compontents to drain nodes automatically. Draino is still required, am I right ?

Random-Liu · 2021-06-30T18:11:08Z

Condition is mostly for informative.

Taint will actually affect the scheduler decision, e.g. not schedule any pod to a node any more, or evict running pods from a node. That kind of decision should be done by the cluster level controller, or else if in the extreme case half nodes decide to taint and evict pods, the cluster may not have enough resource to run those pods.

azman0101 · 2021-06-30T18:38:35Z

Condition is mostly for informative.

Taint will actually affect the scheduler decision, e.g. not schedule any pod to a node any more, or evict running pods from a node. That kind of decision should be done by the cluster level controller, or else if in the extreme case half nodes decide to taint and evict pods, the cluster may not have enough resource to run those pods.

I agree 👍
I'm just saying that Remedy Systems section is misleading because it says that NPD + Descheduler can do the job together.

I want to avoid using draino so, I wonder how to taint by custom condition 🤔

alexispires · 2021-06-30T21:46:32Z

NPD can works with the descheduler only with the predefined (Ready, MemoryPressure...) conditions. So, the current design is in my opinion definitively limited. I understand the decision of spliting responsability but the documentation leads to mesleading.

bilalcaliskan · 2021-07-01T23:03:23Z

I agree with @azman0101, descheduler does not do tainting by itself. According to remedy systems section we can use descheduler for that purpose but sadly i guess there is just one option for that and its draino.

@Random-Liu as i know, Descheduler does the job only if there are specified taints on the node. So we should use 3 different component to taint nodes on specific NodeCondition(npd, draino and Descheduler in that scenario)? I guess its too much efford especially if you have over 10 k8s clusters on production.

vfiftyfive · 2023-09-26T07:16:19Z

What is the status on this?

bilalcaliskan · 2023-09-27T19:14:30Z

there is no more work left on the development side i guess but waiting for a maintainer review.

sennerholm · 2023-11-01T15:39:42Z

@btiernay Could you help finding a maintainer that have the time to push this over the finish line?

btiernay · 2023-11-01T16:58:51Z

@andyxning @wangzhen127 @xueweiz @vteratipally @mmiranda96 Are you available to help review this great addition and help to push it over the finish line? 🙏

btiernay · 2023-11-08T02:39:06Z

Kindly requesting your review, one more time 🙏. Thank you 🙇.

sebastiangaiser · 2024-02-23T09:59:52Z

Hey, what is the current status of this PR? Would be great so see this merged 🙏🏻

btiernay · 2024-04-01T14:26:56Z

/retest-required

k8s-ci-robot · 2024-04-01T14:34:01Z

@bilalcaliskan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-npd-e2e-kubernetes-gce-ubuntu-custom-flags	`c3bea3b`	link	true	`/test pull-npd-e2e-kubernetes-gce-ubuntu-custom-flags`
pull-npd-e2e-kubernetes-gce-ubuntu	`c3bea3b`	link	true	`/test pull-npd-e2e-kubernetes-gce-ubuntu`
pull-npd-e2e-kubernetes-gce-gci	`c3bea3b`	link	true	`/test pull-npd-e2e-kubernetes-gce-gci`
pull-npd-e2e-kubernetes-gce-gci-custom-flags	`c3bea3b`	link	true	`/test pull-npd-e2e-kubernetes-gce-gci-custom-flags`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

btiernay · 2024-04-01T14:42:20Z

@bilalcaliskan Seems like tests are failing due to build issues. Maybe need to rebase?

jonasbadstuebner · 2024-04-16T07:33:07Z

@bilalcaliskan Would be really nice to have this!

wangzhen127 · 2024-05-03T22:46:54Z

Should we close this PR without merging it?

I agree with @Random-Liu's #565 (comment). NPD does not necessarily want to affect the scheduling decisions. For example, we already have examples like Taint Nodes by Condition, in which the node controller taints nodes based on node conditions. This should be the recommended model.

nvermande · 2024-05-04T07:47:32Z

That makes NPD pretty much useless IMO. Thank you for confirming. Plus given that PR has been open for 4 years, that raises even more concerns about using it in production.

btiernay · 2024-05-04T20:36:36Z

Agree with @nvermande's assessment. The lack of this capability makes NPD practically much harder to use, integrate, and be successful with. I can understand why actuation isn't in scope, but imo not having (un)tainting support means poorer ecosystem interop and more work for users.

dogzzdogzz · 2024-05-06T02:51:35Z

Can we introduce an "enabled" flag to allow users to determine whether the node-problem-detector should taint a node and affect scheduling decisions? By default, this feature can be set to false. Additionally, we can provide the considerations in the README to help users evaluate whether to enable this feature.

sebastiangaiser · 2024-05-11T13:11:07Z

@wangzhen127 the readme suggests to use Draino (no commit since 4 years on the master branch - IMO should be removed from the readme), mediK8S (brings it's own ecosystem) and MachineHealthCheck (related to ClusterAPI) as Remedy Systems next to Descheduler. The outdated Draino makes Descheduler "useless" IMO. So for me rejecting this PR (for ~3 years now) feels like this project is EOL.

in the extreme case half nodes decide to taint and evict pods, the cluster may not have enough resource to run those pods

True but I think is upon the user using this feature - so it should be documented. An additional idea would be to add labels to the nodes like tainted-by: node-problem-detector and verifying that X% or a number of nodes could be tainted by node-problem-detector at the same time.

...we already have examples like Taint Nodes by Condition...

True but these are very basic IMO. Getting this PR merged would give users the possibility to define them on their own.

Examples which come to my mind in a short time for this could be:

CNI has problems
Router for BGP gets miss-configured
broken hardware of the nodes when running on-premise

There are several use-cases where this feature makes absolutely sense without creating a new controller because then we could simply fork this project...

wangzhen127 · 2024-05-13T23:20:06Z

As far as I know, NPD has been used by several cloud providers and products in production for many years. The reason why this approach is not recommended has been clearly stated previously. Adding this as an optional feature could work in some cases, but it could also be abused and eventually harm ourselves. Given there are several people feeling strongly about this feature, I suggest to bring this issue to the wider community in the sig-node weekly meeting for feedback. Please let me know when you plan to discuss this. Thanks!

k8s-triage-robot · 2024-08-11T23:21:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-09-10T23:38:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-10-11T00:15:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-10-11T00:15:27Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 22, 2021

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 22, 2021

k8s-ci-robot requested review from andyxning and xueweiz May 22, 2021 20:34

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 22, 2021

tainting logic implemented via configuration

e204891

bilalcaliskan force-pushed the master branch 2 times, most recently from 8d48e33 to e204891 Compare May 22, 2021 20:49

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 22, 2021

bilalcaliskan closed this May 22, 2021

bilalcaliskan reopened this May 22, 2021

k8s-ci-robot assigned andyxning and xueweiz Jun 6, 2021

Niksko mentioned this pull request Jul 23, 2021

Docs around autohealing are misleading kubernetes-sigs/descheduler#606

Closed

bilalcaliskan added 2 commits August 1, 2021 22:25

Merge branch 'kubernetes:master' into master

5dfeffb

Merge branch 'kubernetes:master' into master

494a85a

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2023

Merge branch 'master' into master

ed2ba7b

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2023

fix: try to fix compilation errors because of tidy

c3bea3b

wangzhen127 mentioned this pull request May 3, 2024

Add option flag to taint node for permanent problems #457

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 10, 2024

nogazax mentioned this pull request Sep 17, 2024

Respond to nodeConditions changes kubernetes-sigs/descheduler#1518

Open

k8s-ci-robot closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tainting and untainting logic implemented via configuration #565

tainting and untainting logic implemented via configuration #565

bilalcaliskan commented May 22, 2021

k8s-ci-robot commented May 22, 2021

bilalcaliskan commented May 22, 2021

bilalcaliskan commented May 23, 2021

k8s-ci-robot commented May 23, 2021

bilalcaliskan commented Jun 6, 2021

k8s-ci-robot commented Jun 6, 2021

bilalcaliskan commented Jun 6, 2021

Random-Liu commented Jun 25, 2021 •

edited

Loading

azman0101 commented Jun 30, 2021

Random-Liu commented Jun 30, 2021

azman0101 commented Jun 30, 2021 •

edited

Loading

alexispires commented Jun 30, 2021

bilalcaliskan commented Jul 1, 2021

vfiftyfive commented Sep 26, 2023

bilalcaliskan commented Sep 27, 2023

sennerholm commented Nov 1, 2023

btiernay commented Nov 1, 2023

btiernay commented Nov 8, 2023

sebastiangaiser commented Feb 23, 2024

btiernay commented Apr 1, 2024

k8s-ci-robot commented Apr 1, 2024

btiernay commented Apr 1, 2024

jonasbadstuebner commented Apr 16, 2024

wangzhen127 commented May 3, 2024

nvermande commented May 4, 2024

btiernay commented May 4, 2024

dogzzdogzz commented May 6, 2024

sebastiangaiser commented May 11, 2024 •

edited

Loading

wangzhen127 commented May 13, 2024 •

edited

Loading

k8s-triage-robot commented Aug 11, 2024

k8s-triage-robot commented Sep 10, 2024

k8s-triage-robot commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024

tainting and untainting logic implemented via configuration #565

tainting and untainting logic implemented via configuration #565

Conversation

bilalcaliskan commented May 22, 2021

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

k8s-ci-robot commented May 22, 2021

bilalcaliskan commented May 22, 2021

bilalcaliskan commented May 23, 2021

k8s-ci-robot commented May 23, 2021

bilalcaliskan commented Jun 6, 2021

k8s-ci-robot commented Jun 6, 2021

bilalcaliskan commented Jun 6, 2021

Random-Liu commented Jun 25, 2021 • edited Loading

azman0101 commented Jun 30, 2021

Random-Liu commented Jun 30, 2021

azman0101 commented Jun 30, 2021 • edited Loading

alexispires commented Jun 30, 2021

bilalcaliskan commented Jul 1, 2021

vfiftyfive commented Sep 26, 2023

bilalcaliskan commented Sep 27, 2023

sennerholm commented Nov 1, 2023

btiernay commented Nov 1, 2023

btiernay commented Nov 8, 2023

sebastiangaiser commented Feb 23, 2024

btiernay commented Apr 1, 2024

k8s-ci-robot commented Apr 1, 2024

btiernay commented Apr 1, 2024

jonasbadstuebner commented Apr 16, 2024

wangzhen127 commented May 3, 2024

nvermande commented May 4, 2024

btiernay commented May 4, 2024

dogzzdogzz commented May 6, 2024

sebastiangaiser commented May 11, 2024 • edited Loading

wangzhen127 commented May 13, 2024 • edited Loading

k8s-triage-robot commented Aug 11, 2024

k8s-triage-robot commented Sep 10, 2024

k8s-triage-robot commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024

Random-Liu commented Jun 25, 2021 •

edited

Loading

azman0101 commented Jun 30, 2021 •

edited

Loading

sebastiangaiser commented May 11, 2024 •

edited

Loading

wangzhen127 commented May 13, 2024 •

edited

Loading