Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reattempt pod creation in the face of ResourceQuota errors #905

Merged
merged 1 commit into from May 29, 2019
Merged

Reattempt pod creation in the face of ResourceQuota errors #905

merged 1 commit into from May 29, 2019

Conversation

ghost
Copy link

@ghost ghost commented May 24, 2019

Changes

Fixes #734

TLDR

TaskRuns used to fail immediately if their pods hit the ceiling of a ResourceQuota. Now they don't fail immediately and instead reattempt creating the pod until success or timeout.

Long Version:

TaskRuns create pods. Pods created in a namespace with ResourceQuota can be rejected on the basis of their resource requirements. This rejection manifests in the TaskRun reconciler as an error from the
createPod() function.

Prior to this commit all errors from createPod() were considered fatal and the associated TaskRun would be marked as failed. This commit introduces a process for reattempting pod creation in the face of ResourceQuota errors.

When a ResourceQuota error is encountered, the TR reconciler now performs the following work:

  • The TaskRun is marked as Succeeded/Unknown with reason ExceededResourceQuota and a message including the number of reattempts that have been made
  • A reconcile is scheduled for the TaskRun using a backoff w/ jitter strategy

Pod creation will be continually reattmpted until the ResourceQuota errors stop or the TaskRun times out.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Screenshots

Before

TaskRuns that hit a ResourceQuota ceiling immediately fail out:

2019-05-24_13-23-49

After

TaskRuns that hit a ResourceQuota are placed into a Succeeded/Unknown state and retry their pod creation until success or timeout.

2019-05-24_13-20-49

Release Notes

Task pods that hit resource limits imposed by ResourceQuota objects in a namespace will no longer fail their TaskRun immediately. Pod creation will be repeatedly tried until successfully started or the TaskRun times out.

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label May 24, 2019
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 24, 2019
@abayer
Copy link
Contributor

abayer commented May 24, 2019

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 24, 2019
Copy link
Contributor

@dlorenc dlorenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits on tests.

pkg/reconciler/timeout_handler_test.go Outdated Show resolved Hide resolved
pkg/reconciler/timeout_handler_test.go Outdated Show resolved Hide resolved
c, _ := test.SeedTestData(t, d)
observer, _ := observer.New(zap.InfoLevel)
testHandler := NewTimeoutHandler(c.Kube, c.Pipeline, stopCh, zap.New(observer).Sugar())
dur := 50 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this duration come from? Is it just random?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed variable to be more clear on its purpose here - it's the timer's duration. Added another variable beneath this one to document the timer's failure case more clearly.

}
testHandler.SetTaskRunCallbackFunc(callback)
testHandler.SetTaskRunTimer(taskRun, dur)
<-time.After(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I always get a weird feeling when I see hardcoded sleeps in tests. What about another channel you wait on with a select (and a timeout like this) that you write to in the callback?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, updated to a select {} that relies on a done channel and an expiration timer.

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2019
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels May 28, 2019
@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2019
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I left some minor feedback and then my most major piece of feedback: could waitRun and setTimer share more code (i.e. become one function) and then be able to share tests as well?

/meow boxes

b.count += 1
b.deadline = time.Now().Add(backoffDuration(b.count))
t.backoffs[runObj.GetRunKey()] = b
return b.count, b.deadline, true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super minor, but im curious why you wanted to return b.count and b.deadline as two separate values here, vs. returning b directly? (i.e. return b, true)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff struct felt to me like an implementation detail of the internal state of the backoff-tracking mechanism and I didn't sense a strong reason to expose a new public type in the package's interface. I don't feel strongly about this and am happy to change it but am curious what might be the benefit of making it public?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but am curious what might be the benefit of making it public?

The only one I can think of is simplifying the signature of the function a bit - since you already have a structure!

It reminded me of a pattern I've often seen where a function takes too many parameters or returns too many things, and this is often a sign that another abstraction can be added that would simplify the interface (this website talks about it a bit: https://refactoring.guru/smells/long-parameter-list disclaimer I have never seen this website before haha) - and it seemed like what was happening here was doing the opposite

Just a super minor thing tho!

pkg/reconciler/timeout_handler.go Outdated Show resolved Hide resolved
pkg/reconciler/timeout_handler.go Show resolved Hide resolved
pkg/reconciler/timeout_handler_test.go Outdated Show resolved Hide resolved
Reason: reason,
Message: fmt.Sprintf("%s: %v", msg, err),
})
c.Recorder.Eventf(tr, corev1.EventTypeWarning, "BuildCreationFailed", "Failed to create build pod %q: %v", tr.Name, err)
c.Logger.Errorf("Failed to create build pod for task %q :%v", err, tr.Name)
c.Logger.Errorf("Failed to create build pod for task %q: %v", tr.Name, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about moving the logic around determining the status of the Run out of this function and into a separate function that could have its own tests?

pkg/reconciler/v1alpha1/taskrun/taskrun.go Outdated Show resolved Hide resolved
pkg/reconciler/v1alpha1/taskrun/taskrun.go Outdated Show resolved Hide resolved
@tekton-robot
Copy link
Collaborator

@bobcatfish: cat image

In response to this:

Looking good! I left some minor feedback and then my most major piece of feedback: could waitRun and setTimer share more code (i.e. become one function) and then be able to share tests as well?

/meow boxes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2019
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! My feedback is becoming more and more minor X'D

// NumAttempts reflects the number of times a given StatusKey has been delayed
NumAttempts uint
// NextAttempt is the point in time at which this backoff expires
NextAttempt time.Time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice, i like the new names!

t.Run(tc.description, func(t *testing.T) {
receivedDuration := GetTimeout(tc.inputDuration)
if receivedDuration != tc.expectedDuration {
t.Errorf("expected %q received %q", tc.expectedDuration.String(), receivedDuration.String())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! i think the test is super clear this way :D

stopCh := make(chan struct{})
c, _ := test.SeedTestData(t, d)
observer, _ := observer.New(zap.InfoLevel)
testHandler := NewTimeoutHandler(c.Kube, c.Pipeline, stopCh, zap.New(observer).Sugar())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just realized something but this is totally just scope creep so I totally understand if you want to ignore me XD

I looked at this test and I was like "why does it need a kube client and a pipeline client"? and after looking into it it looks like those clients only exist in the timeout handler to support this one call to CheckTimeouts - which means that instead of passing the clients into the timeout handler, the CheckTimeouts function could take them as arguments :D which if I'm right would simplify the setup required for this test

anyway feel free to ignore this XD

timerFailDeadline := 100 * time.Millisecond
doneCh := make(chan struct{})
callback := func(_ interface{}) {
close(doneCh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also have this function set a boolean or something that would could use to assert that the function was called or not called as expected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially implemented it using a bool that the callback mutated but this felt a little icky due to the inline sleep:

done := false
testHandler.SetTaskRunCallbackFunc(func(_ interface{}) { done = true })
go testHandler.SetTaskRunTimer(taskRun, timerDuration)
<-time.After(timerFailDeadline) // ick?
if !done {
  t.Errorf("timer did not execute task run callback func within expected time")
}

I kind of like how the select approach races the two channels against each other. Full credit to @dlorenc (#905 (comment)) for pointing this out.

I may have completely misinterpreted either or both approaches though so lmk if there's something I'm missing.

packageJitterFunc := jitterFunc
for _, tc := range testcases {
t.Run(tc.description, func(t *testing.T) {
jitterFunc = tc.jitterFunc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super minor: you could do this assignment before the for loop (i.e. doesnt need to happen for every test)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobcatfish I am guessing this is resolved as the jitterFunc comes from the test cases (so got to be packageJitterFunc)

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor comments 👼

packageJitterFunc := jitterFunc
for _, tc := range testcases {
t.Run(tc.description, func(t *testing.T) {
jitterFunc = tc.jitterFunc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobcatfish I am guessing this is resolved as the jitterFunc comes from the test cases (so got to be packageJitterFunc)

for _, tc := range testcases {
t.Run(tc.description, func(t *testing.T) {
jitterFunc = tc.jitterFunc
result := backoffDuration(tc.inputCount)
Copy link
Member

@vdemeester vdemeester May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could pass the jitterFunc as an argument to backoffDuration instead. This would remove the need to set (or have) the jitterFunc global variable (just for the test) 👼

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# in code
backoffDuration(b.NumAttempts, rand.Intn)
# in tests
backoffDuration(tc.inputCount, tc.jitterFunc)

@ghost
Copy link
Author

ghost commented May 29, 2019

@bobcatfish @vdemeester thanks for the feedback so far. I've updated jitterFunc to be an arg passed to backoffDuration(), moved taskrun status -related code to its own method (w/ UTs) and modified CheckTimeouts() to take clientsets as arguments. That last change required some small refactoring in controller/main.go and the pipelinerun / taskrun reconcilers.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me
I'll let final words to @bobcatfish

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2019
@@ -122,22 +181,22 @@ func (t *TimeoutSet) checkPipelineRunTimeouts(namespace string) {

// CheckTimeouts function iterates through all namespaces and calls corresponding
// taskrun/pipelinerun timeout functions
func (t *TimeoutSet) CheckTimeouts() {
namespaces, err := t.kubeclientset.CoreV1().Namespaces().List(metav1.ListOptions{})
func (t *TimeoutSet) CheckTimeouts(kubeclientset kubernetes.Interface, pipelineclientset clientset.Interface) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niiiiice thanks for this cleanup with the clients!!! 😍 😍 😍

@bobcatfish
Copy link
Collaborator

Looks awesome!! Thanks @sbwsg :D and thanks for the clean up too ^.^

image

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2019
@vdemeester
Copy link
Member

ah @sbwsg, needs a rebase 🙇‍♂️

TaskRuns create pods. Pods created in a namespace with ResourceQuota
can be rejected on the basis of their resource requirements. This
rejection manifests in the TaskRun reconciler as an error from the
createPod() function.

Prior to this commit all errors from createPod() were considered
fatal and the associated TaskRun would be marked as failed. This commit
introduces a process for reattempting pod creation in the face of
ResourceQuota errors.

When a ResourceQuota error is encountered, the TR reconciler now
performs the following work:

- The TaskRun is marked as Succeeded/Unknown with reason
ExceededResourceQuota and a message including the number of reattempts
that have been made
- A reconcile is scheduled for the TaskRun using a backoff w/ jitter
strategy

Pod creation will be continually reattmpted until the ResourceQuota
errors stop or the TaskRun times out.
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2019
@ghost
Copy link
Author

ghost commented May 29, 2019

🤞

@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2019
@tekton-robot tekton-robot merged commit d1e178d into tektoncd:master May 29, 2019
@ghost ghost deleted the retry-resource-quota-errors branch May 30, 2019 12:48
bobcatfish added a commit to bobcatfish/pipeline that referenced this pull request Aug 19, 2019
Scott has been provided super detailed helpful reviews for a while now
(58 reviews as of
https://tekton.devstats.cd.foundation/d/46/pr-reviews-by-contributor?orgId=1&var-period=d&var-repo_name=tektoncd%2Fpipeline&var-reviewers="sbwsg").

He has contributed useful and technically challenging features such as
tektoncd#905 (recreate pods in face of
ResourceQuota errors), drove completion of
tektoncd#936
(Graceful sidecar support) and also tektoncd#871
(enforcing default TaskRun timeouts).
@bobcatfish bobcatfish mentioned this pull request Aug 19, 2019
1 task
tekton-robot pushed a commit that referenced this pull request Aug 26, 2019
Scott has been provided super detailed helpful reviews for a while now
(58 reviews as of
https://tekton.devstats.cd.foundation/d/46/pr-reviews-by-contributor?orgId=1&var-period=d&var-repo_name=tektoncd%2Fpipeline&var-reviewers="sbwsg").

He has contributed useful and technically challenging features such as
#905 (recreate pods in face of
ResourceQuota errors), drove completion of
#936
(Graceful sidecar support) and also #871
(enforcing default TaskRun timeouts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schedule Pods in resource-constrained environments
6 participants