Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport] rpc: Add timout for waiting on semaphore in reconnect #4184

Merged
merged 2 commits into from
Apr 5, 2022

Conversation

VadimPlh
Copy link
Contributor

@VadimPlh VadimPlh commented Apr 4, 2022

Cover letter

Backport #4180

It solves the problem like:

  • controller leader node try to reconnect to one of follower
  • Follower is isolated and reconnect will be lock semaphore during timeout (which can be > several seconds)
  • Heartbit manager will try to send hb to isolated node and will be blocked on semaphore
  • Another heartbits will wait hb from step 3
  • Controller leader node will think that another nodes are failed too

Fixes #4183

Release notes

  • Fix stolen heartbits during big timeout for reconnect

It solves the problem like:
1) controller leader node try to reconnect to one of follower
2) Follower is isolated and reconnect will be lock semaphore
during timeout
3) Heartbit manager will try to send hb to isolated node
and will be blocked on semaphore
4) Another heartbits will wait hb from step 3
5) Controller leader node will think that
another nodes are failed

Fixes redpanda-data#4071

(cherry picked from commit 7f9fed4)
(cherry picked from commit 3fbdac5)
@VadimPlh VadimPlh requested a review from dotnwat April 4, 2022 19:05
@andrewhsu andrewhsu added this to the v21.11.12 milestone Apr 4, 2022
@andrewhsu andrewhsu linked an issue Apr 4, 2022 that may be closed by this pull request
@VadimPlh VadimPlh merged commit 7f69a6d into redpanda-data:v21.11.x Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v21.11.x] Failure in RaftAvailabilityTest.test_follower_isolation
3 participants