Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a section about the risks of priority and preemption #201

Closed
wants to merge 2 commits into from

Conversation

bsalamat
Copy link
Member

Add a section about the risks of priority and preemption.

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 23, 2018
@bsalamat
Copy link
Member Author

cc\ @davidopp

@justaugustus
Copy link
Member

/cc @davidopp

is enabled by default.\
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.

There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should say "There are two kinds of critical system pods" (not daemons)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

is enabled by default.\
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.

There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the first XXX to list the other node-level critical system pods and the second XXX to list the other cluster-level critical system pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


In Kubernetes 1.11, priority/preemption is enabled by default and
* per-node daemons continue to be scheduled directly by the DaemonSet controller, bypassing the default scheduler. As in Kubernetes versions before 1.11, the DaemonSet controller does not preempt pods, so we continue to rely on the ["rescheduler"](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/) to guarantee that per-node daemons are able to schedule in a cluster that is full of regular user pods, by evicting regular user pods to make room for them. Per-node daemons are given a priority class of `system-node-critical`.
* cluster-level system pods continue to be scheduled by the default scheduler. The cluster-level daemons are given a priority class of `system-cluster-critical`. Because the default scheduler can preempt pods, the rescheduler in Kubernetes 1.11 is modified to *not* preempt pods to ensure the cluster-level system pods can schedule; instead we rely on the scheduler preemption mechanism to do this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cluster-level daemons/cluster-level system pods/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


The only way to prevent this vulnerability is:
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is in/is an/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The only way to prevent this vulnerability is:
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should say "infinite ResourceQuota for pods"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)

This will prevent anyone who does not have access to the `kube-system` namespace from creating pods with the `system-node-critical` or `system-cluster-critical` priority class, by only allowing pods with those priority classes to be created in the `kube-system` namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "by only allowing..." part could be a bit clearer: "by restricting pods with those priority classes to only be allowed in the kube-system namespace."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @davidopp! PTAL.

is enabled by default.\
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.

There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

is enabled by default.\
Note that **it will be possible for users of the cluster to create pods that block some system daemons from running, and/or evict system daemons that are already running, by creating pods at the `system-cluster-critical` and `system-node-critical` priority classes, which are present in all clusters by default.** Please read the following information to understand the details. This is particularly important for those who have untrusted users in their Kubernetes clusters.

There are two kinds of critical system daemons in Kubernetes -- ones that run per-node as DaemonSets (e.g. fluentd, XXX list the rest of them here) and ones that run per-cluster (possibly more than one instance per cluster, but not one per node) (e.g. DNS, heapster, XXX list the rest of them here).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


In Kubernetes 1.11, priority/preemption is enabled by default and
* per-node daemons continue to be scheduled directly by the DaemonSet controller, bypassing the default scheduler. As in Kubernetes versions before 1.11, the DaemonSet controller does not preempt pods, so we continue to rely on the ["rescheduler"](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/) to guarantee that per-node daemons are able to schedule in a cluster that is full of regular user pods, by evicting regular user pods to make room for them. Per-node daemons are given a priority class of `system-node-critical`.
* cluster-level system pods continue to be scheduled by the default scheduler. The cluster-level daemons are given a priority class of `system-cluster-critical`. Because the default scheduler can preempt pods, the rescheduler in Kubernetes 1.11 is modified to *not* preempt pods to ensure the cluster-level system pods can schedule; instead we rely on the scheduler preemption mechanism to do this.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


The only way to prevent this vulnerability is:
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The only way to prevent this vulnerability is:
* Step 1: Configure the ResourceQuota admission controller (via a config file) to use the ["limitedResources"](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature to require quota for pods in PriorityClass `system-node-critical` and `system-cluster-critical`.
* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Step 2: Enable the [`ResourceQuotaScopeSelectors`](https://kubernetes.io/docs/concepts/policy/resource-quotas/) feature gate (this is in alpha feature in Kubernetes 1.11)
* Step 3: Create infinite ResourceQuota in the `kube-system` namespace at PriorityClass `system-node-critical` and `system-cluster-critical` using the [scopeSelector feature of ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/)

This will prevent anyone who does not have access to the `kube-system` namespace from creating pods with the `system-node-critical` or `system-cluster-critical` priority class, by only allowing pods with those priority classes to be created in the `kube-system` namespace.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@davidopp
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bsalamat, davidopp
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: dchen1107

Assign the PR to them by writing /assign @dchen1107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AishSundar
Copy link
Contributor

/cc @nickchase

@bsalamat
Copy link
Member Author

ping @nickchase @calebamiles for approval

@nickchase
Copy link
Contributor

An abbreviated version of this has been added to the current doc, here: https://docs.google.com/document/d/1MoHdmqSpWT4dJ3AcONwPwquNa2NIBa1dhpb0g8xyyoI/edit with a link to this PR for the full story. If someone's got a better idea, I'm all ears.

@bsalamat
Copy link
Member Author

@davidopp FYI

@davidopp
Copy link
Member

The part you extracted seems fine, but please link to this PR rather than the one you are currently linking to.

@bsalamat bsalamat closed this Jun 27, 2018
@davidopp
Copy link
Member

We'll need a new release note in 1.11.1 that explains the new admission controller that eliminates (for all practical purposes) the vulnerability.

@bsalamat
Copy link
Member Author

@davidopp Sure. I will take care of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants