Improve BuildRun Failure State Transitions #558

qu1queee · 2021-01-29T11:23:27Z

Idea:

This is coming from an internal bug we found in our continuous tests. At the moment a BuildRun can be mark as completed during specific scenarios, but it can happen that at some point later, the BuildRun controller reconciles again and the pod runs to completion.

In the above scenario a BuildRun Status will be in a failed state, while the container image was successfully build.

In order to address the above, we have been discussing on how we should treat a BuildRun in general. Our idea is that a BuildRun is:

A one-try shot and either it starts successfully or not.

which makes sense because we are aiming to run something till Completion. If the above is the case, we will need to ensure this expectation matches the code implementation, in order to avoid the above scenario.

qu1queee · 2021-02-01T14:07:47Z

Keeping this one open for a while. The general idea so far is to follow a

One-try shot approach

where a BuildRun cannot change its state after the CompletionTime is set.

The above is coming from #551

SaschaSchwarze0 · 2021-02-01T15:30:33Z

I would like to broaden the scope of this issue a little bit:

In the BuildRun controller we have the problematic updateBuildRunErrorStatus function which sets the completionTime and the error details. But, we actually use it for two different kind of errors:

User-originated errors like a Build or ServiceAccount referenced in the BuildRun that does not exist.
System-originated errors like a failure to create a TaskRun (I just had a network timeout from the Tekton validating webhook)

While for user-originated errors, I propose to make the BuildRun a one-time shot = we set the BuildRun error details, we also set its completionTime and then the controller MUST NOT return an error so that no further retries happen. What we should consider in this context is to prohibit end user's from updating a BuildRun. We want them to create a new one, as such I do not see any use case to allow them to do an update operation at all one a BuildRun. EDIT: once we want to do something like cancel, we will likely still need to allow updating a BuildRun.

For system-originated errors where we do intend to do a retry, we imo either should stop to update the BuildRun status at all (and only return the error to trigger the retry) or we set a temporary failure without a completionTime that we remove once the retry succeeded.

--

Also a little bit related to my comment in Fix missing BuildRun Conditions updates on errors #548. Fyi @xiujuan95

gabemontero · 2021-02-01T17:33:46Z

FYI whatever traction that might exist for retry in Tekton is centered around this TEP: tektoncd/community#239

Evidently there is some retry support at the PipelineTask level that I at least was previously unaware of. A quote from that TEP: PipelineTask can be rerun after failure when user specifies retries in PipelineTask

And found this: https:/tektoncd/pipeline/blob/master/pkg/apis/pipeline/v1beta1/pipeline_types.go#L137

Granted it seems orthogonal since shipwright does not leverage PIpelineTask.

But another "monitor upstream Tekton in this space" element for this one perhaps.

gabemontero · 2021-02-01T17:35:25Z

I would like to broaden the scope of this issue a little bit:

In the BuildRun controller we have the problematic updateBuildRunErrorStatus function which sets the completionTime and the error details. But, we actually use it for two different kind of errors:
1. User-originated errors like a Build or ServiceAccount referenced in the BuildRun that does not exist.

2. System-originated errors like a failure to create a TaskRun (I just had a network timeout from the Tekton validating webhook)
While for user-originated errors, I propose to make the BuildRun a one-time shot = we set the BuildRun error details, we also set its completionTime and then the controller MUST NOT return an error so that no further retries happen. What we should consider in this context is to prohibit end user's from updating a BuildRun. We want them to create a new one, as such I do not see any use case to allow them to do an update operation at all one a BuildRun.

For system-originated errors where we do intend to do a retry, we imo either should stop to update the BuildRun status at all (and only return the error to trigger the retry) or we set a temporary failure without a completionTime that we remove once the retry succeeded.

--

Also a little bit related to my comment in Fix missing BuildRun Conditions updates on errors #548. Fyi @xiujuan95

I agree with the user vs. system error distinction and that we should allow the controller's behavior to vary based on that @SaschaSchwarze0

adambkaplan · 2021-02-02T16:10:54Z

We want them to create a new one, as such I do not see any use case to allow them to do an update operation at all one a BuildRun. EDIT: once we want to do something like cancel, we will likely still need to allow updating a BuildRun.

Agree - I envision cancel being implemented by setting the cancelled field on the BuildRun spec, and ignoring or rejecting other changes. A validating admission webhook could help here, but IMO is not necessary for minimum viability.

For system-originated errors where we do intend to do a retry, we imo either should stop to update the BuildRun status at all (and only return the error to trigger the retry) or we set a temporary failure without a completionTime that we remove once the retry succeeded.

My preference is to just not update the status and retry the reconciliation. The caveat being that if we reach a reconciliation retry limit, then we should update the status with an error message.

SaschaSchwarze0 · 2021-02-02T16:26:30Z

Good comments @adambkaplan and @gabemontero. Had a call with @qu1queee just an hour ago on this as well. I think we are totally on the same page here.

Just one question on the following @adambkaplan

The caveat being that if we reach a reconciliation retry limit, then we should update the status with an error message.

More than half a year ago we also discussed a scenario in a community meeting or issue (I think it was related to the build-secret relationship at that time which has been handled with a different implementation in the meantime). How would one implement a way to detect that for the same reconciliation request, one is in a retry loop that is already ongoing for n₁ retries or n₂ minutes?

adambkaplan · 2021-02-02T18:30:03Z

How would one implement a way to detect that for the same reconciliation request, one is in a retry loop that is already ongoing for n1 retries or n2 minutes?

Hrm, it seems there is no direct way to say "stop reconciling after X failed attempts". However, in reviewing other issues reported on controller-runtime it seems that if the reconciler returns Requeue: true or we return an error, an exponential backoff is used to ensure there is no hot looping [1]. This takes care of my concerns regarding perpetual retries - hopefully if we are doing this due to a bug we have sufficient logs to surface the issue, and the exponential backoff ensures that we don't swamp the controller with bad retries.

[1] kubernetes-sigs/controller-runtime#808

SaschaSchwarze0 · 2021-02-03T09:34:56Z

How would one implement a way to detect that for the same reconciliation request, one is in a retry loop that is already ongoing for n1 retries or n2 minutes?

Hrm, it seems there is no direct way to say "stop reconciling after X failed attempts". However, in reviewing other issues reported on controller-runtime it seems that if the reconciler returns Requeue: true or we return an error, an exponential backoff is used to ensure there is no hot looping [1]. This takes care of my concerns regarding perpetual retries - hopefully if we are doing this due to a bug we have sufficient logs to surface the issue, and the exponential backoff ensures that we don't swamp the controller with bad retries.

[1] kubernetes-sigs/controller-runtime#808

Okay, we are on the same page then. Thanks for clarifying.

qu1queee · 2021-02-03T11:23:14Z

I will provide a PR to address this issue and also to address this enhancement #548 (comment). I´m assigning this issue to me in the meantime.

SaschaSchwarze0 · 2021-02-18T08:16:12Z

I think this should not yet have been closed.

qu1queee · 2021-02-22T11:40:27Z

FYI, I´m already working on this one.

qu1queee · 2021-03-23T10:12:38Z

This was tackle via #641

qu1queee added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. community labels Jan 29, 2021

qu1queee mentioned this issue Feb 1, 2021

February 1st, 2021 Community Meeting #551

Closed

SaschaSchwarze0 mentioned this issue Feb 2, 2021

Fix missing BuildRun Conditions updates on errors #548

Merged

qu1queee self-assigned this Feb 3, 2021

qu1queee mentioned this issue Feb 16, 2021

Restructure controllers pkg #593

Merged

3 tasks

openshift-merge-robot closed this as completed in #593 Feb 18, 2021

SaschaSchwarze0 reopened this Feb 18, 2021

qu1queee added this to the release-v0.4.0 milestone Feb 19, 2021

qu1queee mentioned this issue Mar 4, 2021

Enhance BuildRun reconciles failure scenarios #641

Merged

4 tasks

adambkaplan added kind/feature Categorizes issue or PR as related to a new feature. and removed community help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 10, 2021

adambkaplan changed the title ~~How should we treat a BuildRun?~~ Improve BuildRun Failure Reporting Mar 10, 2021

adambkaplan changed the title ~~Improve BuildRun Failure Reporting~~ Improve BuildRun Failure State Transitions Mar 10, 2021

openshift-merge-robot closed this as completed in #641 Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BuildRun Failure State Transitions #558

Improve BuildRun Failure State Transitions #558

qu1queee commented Jan 29, 2021

qu1queee commented Feb 1, 2021

SaschaSchwarze0 commented Feb 1, 2021 •

edited

Loading

gabemontero commented Feb 1, 2021

gabemontero commented Feb 1, 2021

adambkaplan commented Feb 2, 2021

SaschaSchwarze0 commented Feb 2, 2021

adambkaplan commented Feb 2, 2021

SaschaSchwarze0 commented Feb 3, 2021

qu1queee commented Feb 3, 2021

SaschaSchwarze0 commented Feb 18, 2021

qu1queee commented Feb 22, 2021

qu1queee commented Mar 23, 2021

Improve BuildRun Failure State Transitions #558

Improve BuildRun Failure State Transitions #558

Comments

qu1queee commented Jan 29, 2021

qu1queee commented Feb 1, 2021

SaschaSchwarze0 commented Feb 1, 2021 • edited Loading

gabemontero commented Feb 1, 2021

gabemontero commented Feb 1, 2021

adambkaplan commented Feb 2, 2021

SaschaSchwarze0 commented Feb 2, 2021

adambkaplan commented Feb 2, 2021

SaschaSchwarze0 commented Feb 3, 2021

qu1queee commented Feb 3, 2021

SaschaSchwarze0 commented Feb 18, 2021

qu1queee commented Feb 22, 2021

qu1queee commented Mar 23, 2021

SaschaSchwarze0 commented Feb 1, 2021 •

edited

Loading