Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make SlaveDB stay in WaitDBSync state instead of sink into Error State if rsync init failed #2667

Merged

Conversation

cheniujh
Copy link
Collaborator

@cheniujh cheniujh commented May 22, 2024

此PR修复了Issue #2664

引发 Issue #2664 的原因

  • 在全量同步的最开始,SlaveDB如果10s内没有从master拉取到相关的metaInfo(要传输的文件列表),rsync init失败的分支会将SlaveDB转入Error状态

测试中还跑出过另外一个问题

  • 在主从TCP连接断开后,哪怕之前已经处于connected状态的DB也会直接开始重新全量同步(正确的行为应该是先尝试增量同步)。引发这个问题的原因是:rsync init失败的处理链路上,会调用一次g_pika_server->SetForceFullSync(true),这会导致在下一次从节点在收到MetaSync Resp时(如果主从TCP连接断开,Slave端会再次发出MetaSyn请求,Master会回复MetaSync Resp),直接强制让所有DB进入TryDBSync,而不是TryConnect状态,所有DB会直接强制发起全量同步,这明显也是不合适的。(由于同 SlaveDB will sink into Error state and never retry if the bgsave operation in master end takes more than 10 seconds at the begining of full sync stage #2664 的处理比较耦合,所以没有另提Issue和PR)

此PR是如何修复上述问题的

  • 1 如果SlaveDB在10s内没有拉取到metaInfo(或者说master端的bgsave任务在10s没有做完),就让SlaveDB处于WaitDBSync状态继续重试metaInfo的拉取,直到重试次数达到一个上限(默认64次,累计时间至少是640s),才进入Error状态。
  • 2 在rsync init失败的处理链路上,去掉对g_pika_server->SetForceFullSync(true)的调用。

This PR fixes Issue #2664

Causes of Issue #2664:

  • At the beginning of a full synchronization, if the SlaveDB does not retrieve the relevant metaInfo (the list of files to be transferred) from the master within 10 seconds, the branch where rsync init fails will cause the SlaveDB to enter the Error state.

Another issue observed during testing:

  • After the master-slave TCP connection is disconnected, even if the DB was previously in a connected state, it will immediately start a new full synchronization (the correct behavior should be to attempt incremental synchronization first). The root cause of this issue is that in the error handling path of rsync init failure, g_pika_server->SetForceFullSync(true) is called once. This causes all DBs to enter TryDBSync state instead of TryConnect when the SlaveDB receives a MetaSync Resp (if the master-slave TCP connection is disconnected, the Slave side will send a MetaSync request, and the Master will reply with MetaSync Resp), forcing all DBs to initiate a full synchronization, which is evidently inappropriate. (Due to the tight coupling with the handling of Issue SlaveDB will sink into Error state and never retry if the bgsave operation in master end takes more than 10 seconds at the begining of full sync stage #2664, no separate Issue and PR were created for this.)

How this PR fixes the issues:

  • 1 If the SlaveDB does not retrieve the metaInfo within 10 seconds (or if the master's bgsave task does not complete within 10 seconds), the SlaveDB will stay in WaitDBSync state and continue to retry fetching metaInfo, instead of entering the Error state.
  • 2 Removed the call to g_pika_server->SetForceFullSync(true) in the error handling path of rsync init failure.

… meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state
@github-actions github-actions bot added the ☢️ Bug Something isn't working label May 22, 2024
@cheniujh cheniujh requested a review from baixin01 May 31, 2024 02:47
@AlexStocks AlexStocks merged commit 44112d4 into OpenAtomFoundation:unstable Jun 3, 2024
12 checks passed
@cheniujh cheniujh deleted the new_full_sync_state branch June 24, 2024 03:21
@coderabbitai coderabbitai bot mentioned this pull request Jul 1, 2024
chejinge pushed a commit that referenced this pull request Jul 31, 2024
… State if rsync init failed (#2667)

* make pika Slave DB stay in WaitDBSync state if rsync init failed(Pull meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state

* add MaxRetryCount

---------

Co-authored-by: cjh <[email protected]>
cheniujh added a commit to cheniujh/pika that referenced this pull request Sep 24, 2024
… State if rsync init failed (OpenAtomFoundation#2667)

* make pika Slave DB stay in WaitDBSync state if rsync init failed(Pull meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state

* add MaxRetryCount

---------

Co-authored-by: cjh <[email protected]>
cheniujh added a commit to cheniujh/pika that referenced this pull request Sep 24, 2024
… State if rsync init failed (OpenAtomFoundation#2667)

* make pika Slave DB stay in WaitDBSync state if rsync init failed(Pull meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state

* add MaxRetryCount

---------

Co-authored-by: cjh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.5.5 4.0.0 ☢️ Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants