Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BuildRun Failure State Transitions #558

Closed
qu1queee opened this issue Jan 29, 2021 · 12 comments · Fixed by #593 or #641
Closed

Improve BuildRun Failure State Transitions #558

qu1queee opened this issue Jan 29, 2021 · 12 comments · Fixed by #593 or #641
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@qu1queee
Copy link
Contributor

Idea:

This is coming from an internal bug we found in our continuous tests. At the moment a BuildRun can be mark as completed during specific scenarios, but it can happen that at some point later, the BuildRun controller reconciles again and the pod runs to completion.

In the above scenario a BuildRun Status will be in a failed state, while the container image was successfully build.

In order to address the above, we have been discussing on how we should treat a BuildRun in general. Our idea is that a BuildRun is:

A one-try shot and either it starts successfully or not.

which makes sense because we are aiming to run something till Completion. If the above is the case, we will need to ensure this expectation matches the code implementation, in order to avoid the above scenario.

@qu1queee qu1queee added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. community labels Jan 29, 2021
@qu1queee
Copy link
Contributor Author

qu1queee commented Feb 1, 2021

Keeping this one open for a while. The general idea so far is to follow a

One-try shot approach

where a BuildRun cannot change its state after the CompletionTime is set.

The above is coming from #551

@SaschaSchwarze0
Copy link
Member

SaschaSchwarze0 commented Feb 1, 2021

I would like to broaden the scope of this issue a little bit:

In the BuildRun controller we have the problematic updateBuildRunErrorStatus function which sets the completionTime and the error details. But, we actually use it for two different kind of errors:

  1. User-originated errors like a Build or ServiceAccount referenced in the BuildRun that does not exist.
  2. System-originated errors like a failure to create a TaskRun (I just had a network timeout from the Tekton validating webhook)

While for user-originated errors, I propose to make the BuildRun a one-time shot = we set the BuildRun error details, we also set its completionTime and then the controller MUST NOT return an error so that no further retries happen. What we should consider in this context is to prohibit end user's from updating a BuildRun. We want them to create a new one, as such I do not see any use case to allow them to do an update operation at all one a BuildRun. EDIT: once we want to do something like cancel, we will likely still need to allow updating a BuildRun.

For system-originated errors where we do intend to do a retry, we imo either should stop to update the BuildRun status at all (and only return the error to trigger the retry) or we set a temporary failure without a completionTime that we remove once the retry succeeded.

--

Also a little bit related to my comment in Fix missing BuildRun Conditions updates on errors #548. Fyi @xiujuan95

@gabemontero
Copy link
Member

FYI whatever traction that might exist for retry in Tekton is centered around this TEP: tektoncd/community#239

Evidently there is some retry support at the PipelineTask level that I at least was previously unaware of. A quote from that TEP: PipelineTask can be rerun after failure when user specifies retries in PipelineTask

And found this: https:/tektoncd/pipeline/blob/master/pkg/apis/pipeline/v1beta1/pipeline_types.go#L137

Granted it seems orthogonal since shipwright does not leverage PIpelineTask.

But another "monitor upstream Tekton in this space" element for this one perhaps.

@gabemontero
Copy link
Member

I would like to broaden the scope of this issue a little bit:

In the BuildRun controller we have the problematic updateBuildRunErrorStatus function which sets the completionTime and the error details. But, we actually use it for two different kind of errors:

1. User-originated errors like a Build or ServiceAccount referenced in the BuildRun that does not exist.

2. System-originated errors like a failure to create a TaskRun (I just had a network timeout from the Tekton validating webhook)

While for user-originated errors, I propose to make the BuildRun a one-time shot = we set the BuildRun error details, we also set its completionTime and then the controller MUST NOT return an error so that no further retries happen. What we should consider in this context is to prohibit end user's from updating a BuildRun. We want them to create a new one, as such I do not see any use case to allow them to do an update operation at all one a BuildRun.

For system-originated errors where we do intend to do a retry, we imo either should stop to update the BuildRun status at all (and only return the error to trigger the retry) or we set a temporary failure without a completionTime that we remove once the retry succeeded.

--

Also a little bit related to my comment in Fix missing BuildRun Conditions updates on errors #548. Fyi @xiujuan95

I agree with the user vs. system error distinction and that we should allow the controller's behavior to vary based on that @SaschaSchwarze0

@adambkaplan
Copy link
Member

We want them to create a new one, as such I do not see any use case to allow them to do an update operation at all one a BuildRun. EDIT: once we want to do something like cancel, we will likely still need to allow updating a BuildRun.

Agree - I envision cancel being implemented by setting the cancelled field on the BuildRun spec, and ignoring or rejecting other changes. A validating admission webhook could help here, but IMO is not necessary for minimum viability.

For system-originated errors where we do intend to do a retry, we imo either should stop to update the BuildRun status at all (and only return the error to trigger the retry) or we set a temporary failure without a completionTime that we remove once the retry succeeded.

My preference is to just not update the status and retry the reconciliation. The caveat being that if we reach a reconciliation retry limit, then we should update the status with an error message.

@SaschaSchwarze0
Copy link
Member

Good comments @adambkaplan and @gabemontero. Had a call with @qu1queee just an hour ago on this as well. I think we are totally on the same page here.

Just one question on the following @adambkaplan

The caveat being that if we reach a reconciliation retry limit, then we should update the status with an error message.

More than half a year ago we also discussed a scenario in a community meeting or issue (I think it was related to the build-secret relationship at that time which has been handled with a different implementation in the meantime). How would one implement a way to detect that for the same reconciliation request, one is in a retry loop that is already ongoing for n1 retries or n2 minutes?

@adambkaplan
Copy link
Member

How would one implement a way to detect that for the same reconciliation request, one is in a retry loop that is already ongoing for n1 retries or n2 minutes?

Hrm, it seems there is no direct way to say "stop reconciling after X failed attempts". However, in reviewing other issues reported on controller-runtime it seems that if the reconciler returns Requeue: true or we return an error, an exponential backoff is used to ensure there is no hot looping [1]. This takes care of my concerns regarding perpetual retries - hopefully if we are doing this due to a bug we have sufficient logs to surface the issue, and the exponential backoff ensures that we don't swamp the controller with bad retries.

[1] kubernetes-sigs/controller-runtime#808

@SaschaSchwarze0
Copy link
Member

How would one implement a way to detect that for the same reconciliation request, one is in a retry loop that is already ongoing for n1 retries or n2 minutes?

Hrm, it seems there is no direct way to say "stop reconciling after X failed attempts". However, in reviewing other issues reported on controller-runtime it seems that if the reconciler returns Requeue: true or we return an error, an exponential backoff is used to ensure there is no hot looping [1]. This takes care of my concerns regarding perpetual retries - hopefully if we are doing this due to a bug we have sufficient logs to surface the issue, and the exponential backoff ensures that we don't swamp the controller with bad retries.

[1] kubernetes-sigs/controller-runtime#808

Okay, we are on the same page then. Thanks for clarifying.

@qu1queee
Copy link
Contributor Author

qu1queee commented Feb 3, 2021

I will provide a PR to address this issue and also to address this enhancement #548 (comment). I´m assigning this issue to me in the meantime.

@SaschaSchwarze0
Copy link
Member

I think this should not yet have been closed.

@qu1queee qu1queee added this to the release-v0.4.0 milestone Feb 19, 2021
@qu1queee
Copy link
Contributor Author

FYI, I´m already working on this one.

@adambkaplan adambkaplan added kind/feature Categorizes issue or PR as related to a new feature. and removed community help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 10, 2021
@adambkaplan adambkaplan changed the title How should we treat a BuildRun? Improve BuildRun Failure Reporting Mar 10, 2021
@adambkaplan adambkaplan changed the title Improve BuildRun Failure Reporting Improve BuildRun Failure State Transitions Mar 10, 2021
@qu1queee
Copy link
Contributor Author

This was tackle via #641

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
4 participants