Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub http 503 response lead to a job not running #207

Open
alfred-stokespace opened this issue May 28, 2024 · 1 comment
Open

GitHub http 503 response lead to a job not running #207

alfred-stokespace opened this issue May 28, 2024 · 1 comment

Comments

@alfred-stokespace
Copy link
Contributor

A user in my org contacted me with a job that never ran.

I found an error message like this ...

2024/05/23 21:16:34 failed to process job: failed to check to register runner (target ID: 10ed1fec-041c-4829-ab1c-b7de7ff9e673, job ID: 6bbe1e7e-9b8c-49c7-bbc3-623eee4ca54c): failed to check existing runner in GitHub: failed to get list of runners: failed to list runners: failed to list organization runners: GET https://REDACTED/actions/runners?per_page=100: 503 OrgTenant service unavailable []

I tracked that error down to call to list runners for the org.

runners, resp, err := listRunners(ctx, client, owner, repo, opts)

in this particular case the trace is starting at starter.go in function ProcessJob where "Strict" config is true on a call to "checkRegisteredRunner".

The result of this 503 is deleteInstance is called in ProcessJob.

The overall impact of that error is that the runner is deleted. This lead to the job not getting worked on.

I contacted GitHub Enterprise support and they responded with the following suggestion...

Encountering a 503 error may occur when the server is temporarily overwhelmed and requires a moment to stabilize. This situation could be attributed to high traffic, maintenance activities, or a brief interruption.

In your specific case, the appearance of the error message "OrgTenant service unavailable" indicates a temporary disruption with the service responsible for managing organization actions/runners.

When confronted with a 503 error, it is advisable to establish a retry mechanism. It is important not to attempt immediate retries but rather consider implementing an exponential backoff strategy. This approach involves increasing the wait time between each retry to allow the server sufficient time to recover and mitigate potential complications.

I'll add a comment with how I mitigated w/code change.

@alfred-stokespace
Copy link
Contributor Author

I noticed that the dependencies of your project already include github.com/cenkalti/backoff/v4

I opted to use that existing dependency rather than add a new one.

I wrapped the listRunners call with something like this ...

func RetryAbleListRunners(ctx context.Context, client *github.Client, owner, repo string, opts *github.ListOptions) (*github.Runners, *github.Response, error) {
	f := func() (*github.Runners, *github.Response, error) {
		return listRunners(ctx, client, owner, repo, opts)
	}

	return RetryingFunction(ctx, f, owner, repo)
}

a call to that function replaces this line

runners, resp, err := listRunners(ctx, client, owner, repo, opts)

The RetryFunction establishes a back off timer ...

func GetBackoffTimer(ctx context.Context, maxRetry uint64) backoff.BackOff {
	off := backoff.NewExponentialBackOff()
	off.InitialInterval = 1 * time.Second
	off.Multiplier = 2
	off.MaxElapsedTime = 10 * time.Second
	off.NextBackOff() // burn one, no matter what I do I can't get the initial to be one second!?
	b := backoff.WithMaxRetries(backoff.WithContext(off, ctx), maxRetry)
	return b
}

... so far so good, hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant