Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ringhash: fix bug where ring hash can be stuck in transient failure despite having available endpoints #7364

Merged
merged 4 commits into from
Aug 20, 2024

Conversation

atollena
Copy link
Collaborator

@atollena atollena commented Jun 28, 2024

See #7363 for a problem description. Added
test cases reproduce the error without the fix, but not reliably, since the ring
may be constructed in a way where we don't get stuck. However running the test
multiple times definitely end up triggering the problem.

The solution is to keep a separate list of available addresses, and when there are no
picks and we trigger connection attempts, try them one at a time.

Fixes #7363.

RELEASE NOTES:

  • ringhash: when used with multiple EDS priorities, fix bug that could prevent a higher priority from recovering from transient failure.

@atollena
Copy link
Collaborator Author

Depends on #7334.

@atollena atollena self-assigned this Jun 28, 2024
@atollena atollena added this to the 1.66 Release milestone Jun 28, 2024
Copy link

codecov bot commented Jun 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.98%. Comparing base (ced812e) to head (67181a1).
Report is 20 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7364      +/-   ##
==========================================
+ Coverage   81.58%   81.98%   +0.40%     
==========================================
  Files         358      360       +2     
  Lines       27388    27528     +140     
==========================================
+ Hits        22345    22570     +225     
+ Misses       3828     3773      -55     
+ Partials     1215     1185      -30     
Files Coverage Δ
internal/testutils/blocking_context_dialer.go 100.00% <100.00%> (ø)
xds/internal/balancer/ringhash/picker.go 86.79% <ø> (+3.21%) ⬆️
xds/internal/balancer/ringhash/ringhash.go 88.88% <100.00%> (+0.17%) ⬆️

... and 40 files with indirect coverage changes

@atollena atollena force-pushed the issue-7363 branch 2 times, most recently from fd418d4 to 4a32e08 Compare July 1, 2024 10:12
Fixes grpc#7363, which has a thorough problem description. The test cases added
reproduce the error without the fix, but not reliably, since the ring may be
constructed in a way where we don't end up getting stuck. Running the test
multiple times definitely end up triggering the problem.

The solution is to keep a separate list of available addresses, and when there
are no picks and we trigger connection attempts, try them one at a time.

RELEASE NOTES:

    ringhash: fix bug that could prevent the balancer to recover from transient failure.
@atollena atollena requested a review from easwars July 18, 2024 08:36
@atollena atollena marked this pull request as ready for review July 18, 2024 08:36
@easwars easwars assigned easwars and unassigned atollena Jul 18, 2024
Copy link
Member

@arvindbr8 arvindbr8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, could you please take a look?

Overall looks good to me. One small bummer is that we are adding yet another field to store the subconn(s). I was thinking hard if making AddressMap an ordered Map (by morphing Keys() and Values()) would help. But then indexing the map with numbers becomes an issue.

xds/internal/balancer/ringhash/ringhash.go Show resolved Hide resolved
xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved
@arvindbr8
Copy link
Member

When we publish relnotes, it would probably help users with some more additional context. for eg: this bug could have effected you if ___

@atollena
Copy link
Collaborator Author

Thanks for the review.

One comment, could you please take a look?

Overall looks good to me. One small bummer is that we are adding yet another field to store the subconn(s). I was thinking hard if making AddressMap an ordered Map (by morphing Keys() and Values()) would help. But then indexing the map with numbers becomes an issue.

I am not sure what you suggest exactly. Do you mean that we could use the existing AddressMap, and turn it into an array by sorting it by key, for example?

@arvindbr8
Copy link
Member

Do you mean that we could use the existing AddressMap, and turn it into an array by sorting it by key, for example?

yes, but we dont have to block merge on that. LGTM! Thanks @atollena for taking care of this

@arvindbr8 arvindbr8 assigned easwars and unassigned arvindbr8 Aug 13, 2024
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is LGTM. A bunch of minor comments on the tests though.

@easwars easwars assigned atollena and unassigned easwars Aug 14, 2024
@arjan-bal arjan-bal modified the milestones: 1.66 Release, 1.67 Release Aug 16, 2024
@atollena atollena assigned easwars and unassigned atollena Aug 17, 2024
@easwars
Copy link
Contributor

easwars commented Aug 20, 2024

Thanks @atollena for handling all the comments and for your continued contributions.

@easwars easwars merged commit ee5cbce into grpc:master Aug 20, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ring_hash stuck in TransientFailure despite having available endpoints
4 participants