-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient RefreshError exceptions from compute_engine credentials are not retried #1562
Comments
I think this is a reasonable thing to add. This repo has an exponential backoff implementation that can be used for retries. |
If it helps, I did a brief survey of implementations of this library in other languages. The Go implementation does not seem to retry on 429 error. The Java implementation also does not seem to retry, but it looking at this mapping the authors may have intended for a 429 response to be retryable. I tried looking into the NodeJS implementation but lost interest. I think it would be retried. |
It's not unheard of for requests to the GCE metadata server to fail with a transient error, like a 503, 500, or 429. These requests can and should be retried, and they are in certain code paths. However, one very important code path, the
credentials.refresh(...)
method for credentials taken from the GCE metadata server, does not.Environment details
google-auth
version: 2.32.0Steps to reproduce
http://localhost:8080/
to inject transient 429 errors. Here is an example that will make every other request to a/token
endpoint fail with status code 429.getcreds.py
:while true ; do http_proxy=http://localhost:8080/ python3 getcreds.py; sleep 1; done
Note how only 1 attempt is logged for the
/token
endpoint.We do retry on certain types of errors:
google-auth-library-python/google/auth/compute_engine/_metadata.py
Lines 199 to 212 in d2ab3af
But in the scenario I'm complaining about, the HTTP request completes successfully, but the response code indicates a transient error, so we hit this code path:
google-auth-library-python/google/auth/compute_engine/_metadata.py
Lines 235 to 242 in d2ab3af
The following patch to
google/auth/compute/_metadata.py
"fixes" the reproduction:I put "fixes" in quotes because in a real failure, a transient error is likely caused by the GCE metadata server or one of its dependencies being overwhelmed, and some degree of exponential backoff should be used. The existing logic makes sense for a timeout, because some time has already been spent waiting.
A separate but related request would be for the
RefreshError
raised to have theretryable
property set appropriately, so library users can decide what to do on transient failures.The text was updated successfully, but these errors were encountered: