Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: Add timout for waiting on semaphore in reconnect #4180

Merged
merged 2 commits into from
Apr 4, 2022

Conversation

VadimPlh
Copy link
Contributor

@VadimPlh VadimPlh commented Apr 4, 2022

Cover letter

It solves the problem like:

  1. controller leader node try to reconnect to one of follower
  2. Follower is isolated and reconnect will be lock semaphore during timeout (which can be > several seconds)
  3. Heartbit manager will try to send hb to isolated node and will be blocked on semaphore
  4. Another heartbits will wait hb from step 3
  5. Controller leader node will think that another nodes are failed too

Fixes #4071

Release notes

  • Fix stolen heartbits during big timeout for reconnect

It solves the problem like:
1) controller leader node try to reconnect to one of follower
2) Follower is isolated and reconnect will be lock semaphore
during timeout
3) Heartbit manager will try to send hb to isolated node
and will be blocked on semaphore
4) Another heartbits will wait hb from step 3
5) Controller leader node will think that
another nodes are failed

Fixes redpanda-data#4071
@VadimPlh VadimPlh changed the title rpc: Add timout to waiting on semaphore in reconnect rpc: Add timout for waiting on semaphore in reconnect Apr 4, 2022
@VadimPlh VadimPlh added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Apr 4, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Apr 4, 2022
@VadimPlh VadimPlh added this to the v21.11.12 milestone Apr 4, 2022
@VadimPlh VadimPlh merged commit e73850d into redpanda-data:dev Apr 4, 2022
@andrewhsu andrewhsu removed this from the v21.11.12 milestone Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure in RaftAvailabilityTest.test_follower_isolation
4 participants