-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After manually failover, some write operations keep receiving 403.3 exception. #3580
Comments
You are using an old SDK version Regarding the general behavior of the SDK: https://docs.microsoft.com/en-us/azure/cosmos-db/troubleshoot-sdk-availability, the write region used is the one informed by the service and it gets refreshed every 5 minutes in the background. Having the SDK go for 4 hours to the same region after failover was initiated might mean the failover for some reason took longer (maybe related to the outage). The behavior you see in VS is correct, once the SDK gets the 403.3, we attempt an immediate refresh and retry on the next write region, if the write region is the same (the service did not yet update the account information because the failover did not complete) then the same error will be seen until the account information reflects the new write region. For the case of your exception, were you logging the |
Hi @ealsur. Thank you very much for your reply! We already upgraded to the latest stable version to get all the improvements. Thank you! But I still have some questions. Sorry I left some part because part of the Diagnostics are truncated due to the length restriction: You can see we are using SDK version And during the outage, as the Cosmos DB declare, during the last two hours of the four hours that we have been receiving 403.3 after manually failover, the outage should have basically ended. And also, right after we restart our app service, the 403.3 immediately disappeared (nothing changed on Cosmos DB side except the early manually failover). So, I wonder will this be related to the original connection fail to retrieve the correct WriteRegion information due to some potential issue and restarting web app will lead to connection reestablished? Why restarting app service can help resolving this issue? And do you have any suggestion if this occurs again? Looking forward to hearing from you. Thank you! |
It's hard without the full diagnostics to understand what happened, the diagnostics normally include all the network interactions that led to a particular result (so we would see that the 403.3 was retried or not and if we got the account information or not). The SDK version is How the East US materialized was that the HTTP requests would fail with Timeouts, the HTTP requests to Gateway (which is the service component that provides the account information) would be affected, so, in theory, the capability of the SDK to rediscover the new regions after you failed over would be affected but without the full diagnostics, it's hard to confirm. I do see the duration in milliseconds being pretty high: For the next version we have made some improvements (see #3467) that will hopefully lower the impact if another Outage of this characteristics materialized. Another thing to point out, there are/were (this SDK version does not tell if you are disposing clients) 7 clients: |
For current SDK, the request to get Account information (like get correct read/write region) is through HTTP not TCP, right? So, will these HTTP request connections to get the account information stay long live? If that is the case, it can help explain why right after restarting App Service the service stop getting 403.3 response. Restarting the app can solve this issue, which I guess should mean new connections to get the Account read/write region are not affected. And thanks for point the client number out, yes, we have 7 clients for some reasons. Mostly we use CosmosClient as singleton. We will try to reduce the number of clients in next step. Thank you! |
And by the way, we have upgraded to the latest SDK version and will also try to log the Cosmos Diagnostics data for exceptions. Could you help provide some suggestions on what we can do if this kind of outage occurs again? After manually failover, should we also restart the app? Thanks! |
@blankor1 HTTP lifetime connections are handled by the System.Net.HttpClient, the SDK does not have direct control. Could be that they are being long lived (You can customize the HttpClient through the Hopefully we will have the new improvements released that should help with such an outage in the future. As far as recommendations on how to proceed, I cannot confirm about manual failover because it would be hard to identity if a future outage is exactly like the one that happened on that date. 2 outages might seem similar but the actual reason could be completely different. Certainly restarting the app would kill any established connections if |
Recently I've tried logging the CosmosDiagnotics and found that its size is too large to be logged completely (about 11kb for a completely log string). Even if we only log the CosmosDiagnotics for exceptions and slow requests (as mentioned in "checklist for troubleshooting"), the log size will still be too large for us to do the complete logging in our production environment. We can get this for local troubleshooting but can't apply this to prod env. And also, will it be possible for Cosmos DB server side to trace the issue stack using the ActivityId in the CosmosDiagnostics? |
@blankor1 Diagnostics are going to be smaller in the next version (#3519). Server side trace is not available for all operations. They trace failures on the service, but there are cases (like a failover) where the protocol is as expected (response of 403.3, the client fails over), where there is nothing inherently wrong. Other cases where you might see Network timeouts, then there is nothing wrong with the service (in cases of Send timeouts, then the service might not even have received the request). Or cases of high latency due to network lag, that is something completely client side that can only be seen through Diagnostics. |
Closing due to lack of activity. Answers were already provided to relevant questions. |
Describe the bug
On Oct 26, during the East US region outage, we do the manually failover as suggestion. But after manually failover, some write operations keep receiving 403.3 exception. This issue keeps active for about 4 hours and only after we restart our web app did everything go back normal.
To Reproduce
Tested. But seems I can't reproduce this locally. I can see one or two times retry from the VS debug output when I do manually failover, but not exception thrown.
Expected behavior
API works well.
Actual behavior
Some of the upsertItemAsync request received 403.3 statuscode.
Environment summary
SDK Version: 3.20.1
OS Version: Windows
Additional context
One of requests' stack traces and logs:
The text was updated successfully, but these errors were encountered: