Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface resource constraint problems in TaskRun Status #876

Merged
merged 1 commit into from May 21, 2019
Merged

Surface resource constraint problems in TaskRun Status #876

merged 1 commit into from May 21, 2019

Conversation

ghost
Copy link

@ghost ghost commented May 17, 2019

Changes

When a TaskRun's Pod is unschedulable due to resource constraints, either on the node or in a namespace with a ResourceQuota, the Status of that TaskRun is left somewhat ambiguous.

Prior to this commit, when resources are limited on a Node the TaskRun will be held in a Succeeded/Unknown state with Reason of "Pending". When resources are limited due to a ResourceQuota the TaskRun will fail with a "CouldntGetTask" reason.

This commit addresses the issue of ambiguous or incorrect TaskRun Status in resource constrained environments by:

  1. Marking a TaskRun as Succeeded/Unknown with an ExceededNodeResources reason when a node doesn't have enough resources. Kubernetes will, in this case, attempt to reschedule the pod when space becomes available for it.
  2. Emitting an event to indicate that a TaskRun's pod hit the resource ceiling on a Node. This shows up in the TaskRun's kubectl describe output.
  3. Marking a TaskRun as Succeeded/False with an ExceededResourceQuota reason when a namespace with ResourceQuota rejects the TR's Pod outright.

This PR is intended to build towards #734 by first clearly indicating when Pods are running into resource constraints. A future PR will attempt to actually tackle the problem of ResourceQuotas flatly failing TaskRuns without any kind of rescheduling.

Screenshots Before, TaskRun Status doesn't reflect resource constraint issues for the pods:

2019-05-17_10-50-43

After, TaskRun Status reflects problems scheduling pods due to resource constraints:

2019-05-17_09-19-18

Here's the ExceededNodeResources event appearing in the TaskRun's kubectl describe output:

2019-05-17_12-36-10

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Question for reviewers: Does this change warrant documentation? It doesn't look like we document Status types in the taskrun.md doc.

See the contribution guide
for more details.

Release Notes

TaskRuns will now have their status updated when their pods fail to be scheduled due to resource constraints on the node or in the namespace.

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label May 17, 2019
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 17, 2019

func isPodExceedingNodeResources(pod *corev1.Pod) bool {
for _, podStatus := range pod.Status.Conditions {
if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strings.Contains() here feels brittle to me but I wasn't sure how else to narrow down an Unschedulable status to one of insufficient resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D

When a TaskRun's Pod is unschedulable due to resource constraints, either on
the node or in a namespace with a ResourceQuota, the Status of that
TaskRun is left somewhat ambiguous.

Prior to this commit, when resources are limited on a Node
the TaskRun will be held in a Succeeded/Unknown state with Reason of
"Pending". When resources are limited due to a ResourceQuota the TaskRun
will fail with a "CouldntGetTask" reason.

This commit addresses the issue of ambiguous or incorrect TaskRun Status
in resource constrained environments by:

1. Marking a TaskRun as Succeeded/Unknown with an ExceededNodeResources
reason when a node doesn't have enough resources. Kubernetes will, in
this case, attempt to reschedule the pod when space becomes available
for it.
2. Emitting an event to indicate that a TaskRun's pod hit the resource
ceiling on a Node. This shows up in the TaskRun's `kubectl describe`
output.
3. Marking a TaskRun as Succeeded/False with an ExceededResourceQuota
reason when a namespace with ResourceQuota rejects the TR's Pod outright.
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @sbwsg ! Just some rambling thoughts about code organization but I think we should go ahead anyway.

p.s. excellent commit message :D

/lgtm
/approve
/meow space

if tr.Spec.TaskRef != nil {
msg = fmt.Sprintf("References a Task %s/%s that doesn't exist", tr.Namespace, tr.Spec.TaskRef.Name)
} else {
msg = fmt.Sprintf("References a TaskSpec with missing information")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is outside the scope of your change, but i think that in spite of what the previous logic indicated (which I may have added... 😇 ) I think there are a variety of things that could have gone wrong here (if i remember right, even templating problems) - so one option would be to make this super generic, e.g. something like "invalid task"?


func isPodExceedingNodeResources(pod *corev1.Pod) bool {
for _, podStatus := range pod.Status.Conditions {
if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D

} else {
reason = "Pending"
msg = getWaitingMessage(pod)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like we're starting to build up a lot of logic in this file/package around looking at the pod and from that determining a reason and a msg - I wonder if we could move some of this out into its own package, with tests?

anyway i dont feel strongly enough about this to block merging tho, maybe it's something we can revisit

@tekton-robot
Copy link
Collaborator

@bobcatfish: cat image

In response to this:

Thanks for this @sbwsg ! Just some rambling thoughts about code organization but I think we should go ahead anyway.

p.s. excellent commit message :D

/lgtm
/approve
/meow space

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2019
@tekton-robot tekton-robot merged commit a1119d0 into tektoncd:master May 21, 2019
@ghost ghost deleted the surface-resource-constrained-taskruns branch May 22, 2019 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants