Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After manually failover, some write operations keep receiving 403.3 exception. #3580

Closed
blankor1 opened this issue Nov 22, 2022 · 9 comments
Closed

Comments

@blankor1
Copy link

Describe the bug
On Oct 26, during the East US region outage, we do the manually failover as suggestion. But after manually failover, some write operations keep receiving 403.3 exception. This issue keeps active for about 4 hours and only after we restart our web app did everything go back normal.

To Reproduce
Tested. But seems I can't reproduce this locally. I can see one or two times retry from the VS debug output when I do manually failover, but not exception thrown.

Expected behavior
API works well.

Actual behavior
Some of the upsertItemAsync request received 403.3 statuscode.

Environment summary
SDK Version: 3.20.1
OS Version: Windows

Additional context
One of requests' stack traces and logs:

Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: Forbidden (403); Substatus: 3; ActivityId: e65dda88-d137-47d6-bfc9-85acbb5171d1; Reason: (Message: {"Errors":["The requested operation cannot be performed at this region"]}
ActivityId: e65dda88-d137-47d6-bfc9-85acbb5171d1, Request URI: /apps/1a4eb068-d894-4dce-90e6-ec076af008b8/services/554cd209-b1d4-424c-ba09-ad412302af29/partitions/71fd39ab-3717-4d91-a2ae-73242ed0cdf0/replicas/132785429113879436p/, RequestStats: Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, SDK: Windows/10.0.14393 cosmos-netstandard-sdk/3.19.3);
   at Microsoft.Azure.Documents.StoreResult.ToResponse(RequestChargeTracker requestChargeTracker)
   at Microsoft.Azure.Documents.ConsistencyWriter.WritePrivateAsync(DocumentServiceRequest request, TimeoutHelper timeout, Boolean forceRefresh)
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.ExecuteRetryAsync(Func`1 callbackMethod, Func`3 callShouldRetry, Func`1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action`1 preRetryCallback)
   at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)
   at Microsoft.Azure.Documents.BackoffRetryUtility`1.ExecuteRetryAsync(Func`1 callbackMethod, Func`3 callShouldRetry, Func`1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action`1 preRetryCallback)
   at Microsoft.Azure.Documents.ConsistencyWriter.WriteAsync(DocumentServiceRequest entity, TimeoutHelper timeout, Boolean forceRefresh, CancellationToken cancellationToken)
   at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass30_0.<<InvokeAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)
   at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)
   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)
   at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func`2 prepareRequestAsyncDelegate)
   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessUpsertAsync(IStoreModel storeProxy, DocumentServiceRequest serviceRequest, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
@ealsur
Copy link
Member

ealsur commented Nov 22, 2022

You are using an old SDK version 3.19.3 please refer to the changelog and look for "Avalability" as the category of improvements done after that version.

Regarding the general behavior of the SDK: https://docs.microsoft.com/en-us/azure/cosmos-db/troubleshoot-sdk-availability, the write region used is the one informed by the service and it gets refreshed every 5 minutes in the background. Having the SDK go for 4 hours to the same region after failover was initiated might mean the failover for some reason took longer (maybe related to the outage).

The behavior you see in VS is correct, once the SDK gets the 403.3, we attempt an immediate refresh and retry on the next write region, if the write region is the same (the service did not yet update the account information because the failover did not complete) then the same error will be seen until the account information reflects the new write region.

For the case of your exception, were you logging the ToString of the exception? It looks like it has no Diagnostics. Did this exception bubble up to your user code or was part of traces?

@blankor1
Copy link
Author

Hi @ealsur. Thank you very much for your reply! We already upgraded to the latest stable version to get all the improvements. Thank you! But I still have some questions.

Sorry I left some part because part of the Diagnostics are truncated due to the length restriction:
--- Cosmos Diagnostics ---{"name":"UpsertItemAsync","id":"fcf0bb27-c1d3-4f10-86c9-622871e6cd76","caller info":{"member":"OperationHelperWithRootTraceAsync","file":"ClientContextCore.cs","line":219},"start time":"04:50:10:673","duration in milliseconds":120889.6855,"data":{"Client Configuration":{"Client Created Time Utc":"2022-10-25T10:04:09.0211716Z","NumberOfClientsCreated":7,"User Agent":"cosmos-netstandard-sdk/3.20.1|3.19.3|2|X64|Microsoft Windows 10.0.14393|.NET 6.0.9|N|","ConnectionConfig":{"gw":"(cps:50, urto:10, p:False, httpf: False)","rntbd":"(cto: 5, icto: -1, m--TRUNCATED--

You can see we are using SDK version 3.20.1 in the Diagnostics. I'm sure we are referring to this version. But the log shows the SDK version is 3.19.3. Do you know why this happen? I do see improvements in the release note and the direct related one: Availability: Fixes get account info retry logic to not go to secondary regions on 403(Forbidden) is in 3.20.0 which should already be contained in 3.20.1.

And during the outage, as the Cosmos DB declare, during the last two hours of the four hours that we have been receiving 403.3 after manually failover, the outage should have basically ended. And also, right after we restart our app service, the 403.3 immediately disappeared (nothing changed on Cosmos DB side except the early manually failover). So, I wonder will this be related to the original connection fail to retrieve the correct WriteRegion information due to some potential issue and restarting web app will lead to connection reestablished? Why restarting app service can help resolving this issue? And do you have any suggestion if this occurs again?

Looking forward to hearing from you. Thank you!

@ealsur
Copy link
Member

ealsur commented Nov 23, 2022

It's hard without the full diagnostics to understand what happened, the diagnostics normally include all the network interactions that led to a particular result (so we would see that the 403.3 was retried or not and if we got the account information or not).

The SDK version is 3.20.1 this is on the user agent: cosmos-netstandard-sdk/3.20.1|3.19.3|2|X64|Microsoft Windows 10.0.14393|.NET 6.0.9|N|, the 3.19.3 is the version of another component inside the SDK, not relevant (one is next to the other in the string).

How the East US materialized was that the HTTP requests would fail with Timeouts, the HTTP requests to Gateway (which is the service component that provides the account information) would be affected, so, in theory, the capability of the SDK to rediscover the new regions after you failed over would be affected but without the full diagnostics, it's hard to confirm. I do see the duration in milliseconds being pretty high: 120889.6855 that would at least indicate that this ran for 2 minutes and that could not have been a single request for certain. But we don't see which regions were involved either (did it retry? in which region?).

For the next version we have made some improvements (see #3467) that will hopefully lower the impact if another Outage of this characteristics materialized.

Another thing to point out, there are/were (this SDK version does not tell if you are disposing clients) 7 clients: "NumberOfClientsCreated":7 do you expect to work with 7 accounts during the application's life time? If yes, that's ok, but if no, please check if you are correctly following the singleton pattern: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/best-practice-dotnet

@blankor1
Copy link
Author

For current SDK, the request to get Account information (like get correct read/write region) is through HTTP not TCP, right? So, will these HTTP request connections to get the account information stay long live?

If that is the case, it can help explain why right after restarting App Service the service stop getting 403.3 response. Restarting the app can solve this issue, which I guess should mean new connections to get the Account read/write region are not affected.
Could you help confirm this? Thanks a lot!

And thanks for point the client number out, yes, we have 7 clients for some reasons. Mostly we use CosmosClient as singleton. We will try to reduce the number of clients in next step. Thank you!

@blankor1
Copy link
Author

And by the way, we have upgraded to the latest SDK version and will also try to log the Cosmos Diagnostics data for exceptions. Could you help provide some suggestions on what we can do if this kind of outage occurs again? After manually failover, should we also restart the app? Thanks!

@ealsur
Copy link
Member

ealsur commented Nov 25, 2022

@blankor1 HTTP lifetime connections are handled by the System.Net.HttpClient, the SDK does not have direct control. Could be that they are being long lived (You can customize the HttpClient through the CosmosClientOptions.HttpClientFactory).

Hopefully we will have the new improvements released that should help with such an outage in the future. As far as recommendations on how to proceed, I cannot confirm about manual failover because it would be hard to identity if a future outage is exactly like the one that happened on that date. 2 outages might seem similar but the actual reason could be completely different.

Certainly restarting the app would kill any established connections if HttpClient is keeping them alive. There might be some configuration option that can be set to avoid keeping those connections forever, but I am not currently aware of which (if there is, then the CosmosClientOptions.HttpClientFactory would allow you to configure it).

@blankor1
Copy link
Author

Recently I've tried logging the CosmosDiagnotics and found that its size is too large to be logged completely (about 11kb for a completely log string). Even if we only log the CosmosDiagnotics for exceptions and slow requests (as mentioned in "checklist for troubleshooting"), the log size will still be too large for us to do the complete logging in our production environment. We can get this for local troubleshooting but can't apply this to prod env.

And also, will it be possible for Cosmos DB server side to trace the issue stack using the ActivityId in the CosmosDiagnostics?

@ealsur
Copy link
Member

ealsur commented Nov 30, 2022

@blankor1 Diagnostics are going to be smaller in the next version (#3519).

Server side trace is not available for all operations. They trace failures on the service, but there are cases (like a failover) where the protocol is as expected (response of 403.3, the client fails over), where there is nothing inherently wrong.

Other cases where you might see Network timeouts, then there is nothing wrong with the service (in cases of Send timeouts, then the service might not even have received the request). Or cases of high latency due to network lag, that is something completely client side that can only be seen through Diagnostics.

@ealsur
Copy link
Member

ealsur commented Dec 16, 2022

Closing due to lack of activity. Answers were already provided to relevant questions.

@ealsur ealsur closed this as completed Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants