Surface resource constraint problems in TaskRun Status #876

ghost · 2019-05-17T15:05:30Z

Changes

When a TaskRun's Pod is unschedulable due to resource constraints, either on the node or in a namespace with a ResourceQuota, the Status of that TaskRun is left somewhat ambiguous.

Prior to this commit, when resources are limited on a Node the TaskRun will be held in a Succeeded/Unknown state with Reason of "Pending". When resources are limited due to a ResourceQuota the TaskRun will fail with a "CouldntGetTask" reason.

This commit addresses the issue of ambiguous or incorrect TaskRun Status in resource constrained environments by:

Marking a TaskRun as Succeeded/Unknown with an ExceededNodeResources reason when a node doesn't have enough resources. Kubernetes will, in this case, attempt to reschedule the pod when space becomes available for it.
Emitting an event to indicate that a TaskRun's pod hit the resource ceiling on a Node. This shows up in the TaskRun's kubectl describe output.
Marking a TaskRun as Succeeded/False with an ExceededResourceQuota reason when a namespace with ResourceQuota rejects the TR's Pod outright.

This PR is intended to build towards #734 by first clearly indicating when Pods are running into resource constraints. A future PR will attempt to actually tackle the problem of ResourceQuotas flatly failing TaskRuns without any kind of rescheduling.

Screenshots

Before, TaskRun Status doesn't reflect resource constraint issues for the pods:

After, TaskRun Status reflects problems scheduling pods due to resource constraints:

Here's the ExceededNodeResources event appearing in the TaskRun's kubectl describe output:

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

Question for reviewers: Does this change warrant documentation? It doesn't look like we document Status types in the taskrun.md doc.

See the contribution guide
for more details.

Release Notes

TaskRuns will now have their status updated when their pods fail to be scheduled due to resource constraints on the node or in the namespace.

ghost · 2019-05-17T15:08:00Z

pkg/reconciler/v1alpha1/taskrun/taskrun.go

+
+func isPodExceedingNodeResources(pod *corev1.Pod) bool {
+ for _, podStatus := range pod.Status.Conditions {
+ if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") {


The strings.Contains() here feels brittle to me but I wasn't sure how else to narrow down an Unschedulable status to one of insufficient resources.

i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D

When a TaskRun's Pod is unschedulable due to resource constraints, either on the node or in a namespace with a ResourceQuota, the Status of that TaskRun is left somewhat ambiguous. Prior to this commit, when resources are limited on a Node the TaskRun will be held in a Succeeded/Unknown state with Reason of "Pending". When resources are limited due to a ResourceQuota the TaskRun will fail with a "CouldntGetTask" reason. This commit addresses the issue of ambiguous or incorrect TaskRun Status in resource constrained environments by: 1. Marking a TaskRun as Succeeded/Unknown with an ExceededNodeResources reason when a node doesn't have enough resources. Kubernetes will, in this case, attempt to reschedule the pod when space becomes available for it. 2. Emitting an event to indicate that a TaskRun's pod hit the resource ceiling on a Node. This shows up in the TaskRun's `kubectl describe` output. 3. Marking a TaskRun as Succeeded/False with an ExceededResourceQuota reason when a namespace with ResourceQuota rejects the TR's Pod outright.

bobcatfish

Thanks for this @sbwsg ! Just some rambling thoughts about code organization but I think we should go ahead anyway.

p.s. excellent commit message :D

/lgtm
/approve
/meow space

bobcatfish · 2019-05-21T13:28:30Z

pkg/reconciler/v1alpha1/taskrun/taskrun.go

+ if tr.Spec.TaskRef != nil {
+ msg = fmt.Sprintf("References a Task %s/%s that doesn't exist", tr.Namespace, tr.Spec.TaskRef.Name)
+ } else {
+ msg = fmt.Sprintf("References a TaskSpec with missing information")


this is outside the scope of your change, but i think that in spite of what the previous logic indicated (which I may have added... 😇 ) I think there are a variety of things that could have gone wrong here (if i remember right, even templating problems) - so one option would be to make this super generic, e.g. something like "invalid task"?

bobcatfish · 2019-05-21T13:29:34Z

pkg/reconciler/v1alpha1/taskrun/taskrun.go

+
+func isPodExceedingNodeResources(pod *corev1.Pod) bool {
+ for _, podStatus := range pod.Status.Conditions {
+ if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") {


i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D

bobcatfish · 2019-05-21T13:32:01Z

pkg/reconciler/v1alpha1/taskrun/taskrun.go

+ } else {
+ reason = "Pending"
+ msg = getWaitingMessage(pod)
+ }


it feels like we're starting to build up a lot of logic in this file/package around looking at the pod and from that determining a reason and a msg - I wonder if we could move some of this out into its own package, with tests?

anyway i dont feel strongly enough about this to block merging tho, maybe it's something we can revisit

tekton-robot · 2019-05-21T13:32:37Z

@bobcatfish:

In response to this:

Thanks for this @sbwsg ! Just some rambling thoughts about code organization but I think we should go ahead anyway.

p.s. excellent commit message :D

/lgtm
/approve
/meow space

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2019-05-21T13:32:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bobcatfish]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label May 17, 2019

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 17, 2019

tekton-robot requested review from dlorenc and imjasonh May 17, 2019 15:05

ghost commented May 17, 2019

View reviewed changes

bobcatfish reviewed May 21, 2019

View reviewed changes

tekton-robot assigned bobcatfish May 21, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2019

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2019

tekton-robot merged commit a1119d0 into tektoncd:master May 21, 2019

bobcatfish mentioned this pull request May 21, 2019

Failed to parse image ${inputs.params.DOCKER_IMAGE}: could not parse reference #885

Closed

ghost deleted the surface-resource-constrained-taskruns branch May 22, 2019 11:54

ghost mentioned this pull request May 22, 2019

Generalize messages when TaskRun errs during pod creation #891

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface resource constraint problems in TaskRun Status #876

Surface resource constraint problems in TaskRun Status #876

ghost commented May 17, 2019 •

edited by ghost

Loading

ghost May 17, 2019

bobcatfish May 21, 2019

bobcatfish left a comment

bobcatfish May 21, 2019

bobcatfish May 21, 2019

bobcatfish May 21, 2019

tekton-robot commented May 21, 2019

tekton-robot commented May 21, 2019

Surface resource constraint problems in TaskRun Status #876

Surface resource constraint problems in TaskRun Status #876

Conversation

ghost commented May 17, 2019 • edited by ghost Loading

Changes

Submitter Checklist

Release Notes

ghost May 17, 2019

Choose a reason for hiding this comment

bobcatfish May 21, 2019

Choose a reason for hiding this comment

bobcatfish left a comment

Choose a reason for hiding this comment

bobcatfish May 21, 2019

Choose a reason for hiding this comment

bobcatfish May 21, 2019

Choose a reason for hiding this comment

bobcatfish May 21, 2019

Choose a reason for hiding this comment

tekton-robot commented May 21, 2019

tekton-robot commented May 21, 2019

ghost commented May 17, 2019 •

edited by ghost

Loading