-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface resource constraint problems in TaskRun Status #876
Conversation
|
||
func isPodExceedingNodeResources(pod *corev1.Pod) bool { | ||
for _, podStatus := range pod.Status.Conditions { | ||
if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strings.Contains()
here feels brittle to me but I wasn't sure how else to narrow down an Unschedulable status to one of insufficient resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D
When a TaskRun's Pod is unschedulable due to resource constraints, either on the node or in a namespace with a ResourceQuota, the Status of that TaskRun is left somewhat ambiguous. Prior to this commit, when resources are limited on a Node the TaskRun will be held in a Succeeded/Unknown state with Reason of "Pending". When resources are limited due to a ResourceQuota the TaskRun will fail with a "CouldntGetTask" reason. This commit addresses the issue of ambiguous or incorrect TaskRun Status in resource constrained environments by: 1. Marking a TaskRun as Succeeded/Unknown with an ExceededNodeResources reason when a node doesn't have enough resources. Kubernetes will, in this case, attempt to reschedule the pod when space becomes available for it. 2. Emitting an event to indicate that a TaskRun's pod hit the resource ceiling on a Node. This shows up in the TaskRun's `kubectl describe` output. 3. Marking a TaskRun as Succeeded/False with an ExceededResourceQuota reason when a namespace with ResourceQuota rejects the TR's Pod outright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @sbwsg ! Just some rambling thoughts about code organization but I think we should go ahead anyway.
p.s. excellent commit message :D
/lgtm
/approve
/meow space
if tr.Spec.TaskRef != nil { | ||
msg = fmt.Sprintf("References a Task %s/%s that doesn't exist", tr.Namespace, tr.Spec.TaskRef.Name) | ||
} else { | ||
msg = fmt.Sprintf("References a TaskSpec with missing information") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is outside the scope of your change, but i think that in spite of what the previous logic indicated (which I may have added... 😇 ) I think there are a variety of things that could have gone wrong here (if i remember right, even templating problems) - so one option would be to make this super generic, e.g. something like "invalid task"?
|
||
func isPodExceedingNodeResources(pod *corev1.Pod) bool { | ||
for _, podStatus := range pod.Status.Conditions { | ||
if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D
} else { | ||
reason = "Pending" | ||
msg = getWaitingMessage(pod) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like we're starting to build up a lot of logic in this file/package around looking at the pod and from that determining a reason
and a msg
- I wonder if we could move some of this out into its own package, with tests?
anyway i dont feel strongly enough about this to block merging tho, maybe it's something we can revisit
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish, sbwsg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes
When a TaskRun's Pod is unschedulable due to resource constraints, either on the node or in a namespace with a ResourceQuota, the Status of that TaskRun is left somewhat ambiguous.
Prior to this commit, when resources are limited on a Node the TaskRun will be held in a Succeeded/Unknown state with Reason of "Pending". When resources are limited due to a ResourceQuota the TaskRun will fail with a "CouldntGetTask" reason.
This commit addresses the issue of ambiguous or incorrect TaskRun Status in resource constrained environments by:
kubectl describe
output.This PR is intended to build towards #734 by first clearly indicating when Pods are running into resource constraints. A future PR will attempt to actually tackle the problem of ResourceQuotas flatly failing TaskRuns without any kind of rescheduling.
Screenshots
Before, TaskRun Status doesn't reflect resource constraint issues for the pods:After, TaskRun Status reflects problems scheduling pods due to resource constraints:
Here's the ExceededNodeResources event appearing in the TaskRun's
kubectl describe
output:Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
Question for reviewers: Does this change warrant documentation? It doesn't look like we document Status types in the taskrun.md doc.
See the contribution guide
for more details.
Release Notes