fix: make SlaveDB stay in WaitDBSync state instead of sink into Error State if rsync init failed #2667

cheniujh · 2024-05-22T02:43:55Z

此PR修复了Issue #2664

引发 Issue #2664 的原因：

在全量同步的最开始，SlaveDB如果10s内没有从master拉取到相关的metaInfo(要传输的文件列表)，rsync init失败的分支会将SlaveDB转入Error状态

测试中还跑出过另外一个问题

在主从TCP连接断开后，哪怕之前已经处于connected状态的DB也会直接开始重新全量同步（正确的行为应该是先尝试增量同步）。引发这个问题的原因是：rsync init失败的处理链路上，会调用一次g_pika_server->SetForceFullSync(true)，这会导致在下一次从节点在收到MetaSync Resp时（如果主从TCP连接断开，Slave端会再次发出MetaSyn请求,Master会回复MetaSync Resp），直接强制让所有DB进入TryDBSync，而不是TryConnect状态，所有DB会直接强制发起全量同步，这明显也是不合适的。（由于同 SlaveDB will sink into Error state and never retry if the bgsave operation in master end takes more than 10 seconds at the begining of full sync stage #2664 的处理比较耦合，所以没有另提Issue和PR）

此PR是如何修复上述问题的：

1 如果SlaveDB在10s内没有拉取到metaInfo(或者说master端的bgsave任务在10s没有做完)，就让SlaveDB处于WaitDBSync状态继续重试metaInfo的拉取，直到重试次数达到一个上限（默认64次，累计时间至少是640s），才进入Error状态。
2 在rsync init失败的处理链路上，去掉对g_pika_server->SetForceFullSync(true)的调用。

This PR fixes Issue #2664

Causes of Issue #2664:

At the beginning of a full synchronization, if the SlaveDB does not retrieve the relevant metaInfo (the list of files to be transferred) from the master within 10 seconds, the branch where rsync init fails will cause the SlaveDB to enter the Error state.

Another issue observed during testing:

After the master-slave TCP connection is disconnected, even if the DB was previously in a connected state, it will immediately start a new full synchronization (the correct behavior should be to attempt incremental synchronization first). The root cause of this issue is that in the error handling path of rsync init failure, g_pika_server->SetForceFullSync(true) is called once. This causes all DBs to enter TryDBSync state instead of TryConnect when the SlaveDB receives a MetaSync Resp (if the master-slave TCP connection is disconnected, the Slave side will send a MetaSync request, and the Master will reply with MetaSync Resp), forcing all DBs to initiate a full synchronization, which is evidently inappropriate. (Due to the tight coupling with the handling of Issue SlaveDB will sink into Error state and never retry if the bgsave operation in master end takes more than 10 seconds at the begining of full sync stage #2664, no separate Issue and PR were created for this.)

How this PR fixes the issues:

1 If the SlaveDB does not retrieve the metaInfo within 10 seconds (or if the master's bgsave task does not complete within 10 seconds), the SlaveDB will stay in WaitDBSync state and continue to retry fetching metaInfo, instead of entering the Error state.
2 Removed the call to g_pika_server->SetForceFullSync(true) in the error handling path of rsync init failure.

… meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state

src/pika_rm.cc

… State if rsync init failed (#2667) * make pika Slave DB stay in WaitDBSync state if rsync init failed(Pull meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state * add MaxRetryCount --------- Co-authored-by: cjh <[email protected]>

… State if rsync init failed (OpenAtomFoundation#2667) * make pika Slave DB stay in WaitDBSync state if rsync init failed(Pull meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state * add MaxRetryCount --------- Co-authored-by: cjh <[email protected]>

make pika Slave DB stay in WaitDBSync state if rsync init failed(Pull…

97d41c4

… meta from master timeout) to ensure the slave DB will continue to retry,instead of sinking into Error state

github-actions bot added the ☢️ Bug Something isn't working label May 22, 2024

cheniujh requested review from AlexStocks, wangshao1, chejinge and baixin01 May 22, 2024 02:44

cheniujh added 4.0.0 3.5.5 labels May 22, 2024

baixin01 reviewed May 29, 2024

View reviewed changes

src/pika_rm.cc Show resolved Hide resolved

baixin01 reviewed May 29, 2024

View reviewed changes

src/pika_rm.cc Show resolved Hide resolved

add MaxRetryCount

22fb4ee

cheniujh requested a review from baixin01 May 31, 2024 02:47

Merge branch 'OpenAtomFoundation:unstable' into new_full_sync_state

c0cf7fc

wangshao1 approved these changes Jun 3, 2024

View reviewed changes

AlexStocks approved these changes Jun 3, 2024

View reviewed changes

baixin01 reviewed Jun 3, 2024

View reviewed changes

src/pika_rm.cc Show resolved Hide resolved

baixin01 approved these changes Jun 3, 2024

View reviewed changes

AlexStocks merged commit 44112d4 into OpenAtomFoundation:unstable Jun 3, 2024
12 checks passed

cheniujh deleted the new_full_sync_state branch June 24, 2024 03:21

coderabbitai bot mentioned this pull request Jul 1, 2024

docs: v4.0.0 changelog #2764

Merged

AlexStocks mentioned this pull request Aug 15, 2024

docs:add 355 changelog #2867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make SlaveDB stay in WaitDBSync state instead of sink into Error State if rsync init failed #2667

fix: make SlaveDB stay in WaitDBSync state instead of sink into Error State if rsync init failed #2667

cheniujh commented May 22, 2024 •

edited

Loading

fix: make SlaveDB stay in WaitDBSync state instead of sink into Error State if rsync init failed #2667

fix: make SlaveDB stay in WaitDBSync state instead of sink into Error State if rsync init failed #2667

Conversation

cheniujh commented May 22, 2024 • edited Loading

cheniujh commented May 22, 2024 •

edited

Loading