Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG 注册中心集群数据不一致 #7151

Closed
Cczzzz opened this issue Oct 30, 2021 · 9 comments
Closed

BUG 注册中心集群数据不一致 #7151

Cczzzz opened this issue Oct 30, 2021 · 9 comments
Labels
status/duplicate This issue or pull request already exists

Comments

@Cczzzz
Copy link

Cczzzz commented Oct 30, 2021

当集群内一个节点,宕机时。其他节点在写入新数据时会出现注册中心数据不一致的问题。
我们已经查询到了原因,并且可以稳定复现,复现的情况符合推测。
场景:
本来有3个节点,宕机一个节点。
一段时间过后,新注册的实例在剩余节点中不一致,(临时实例,使用distro协议)
并且新注册的实例在一定时间后,还是会被正确同步。
原因:
com.alibaba.nacos.common.task.engine.NacosExecuteTaskExecuteEngine # executeWorkers 属性
com.alibaba.nacos.common.task.engine.TaskExecuteWorker#queue 属性
中 BlockingQueue queue 被写满
当一个节点宕机时,其他节点仍然还会进行向宕机节点同步数据的任务,这些任务会失败,然后继续被重新提交。

当队列处理速度>提交数据时,新的实例的同步任务加会被堵塞。
我们最高观察到queue 内堵塞 3w+的任务。

问题是向宕机节点同步的任务耗时太高,这个方法观察耗时 需要 300ms
2021-10-30 07:17:29,952 WARN [DISTRO] Sync data change failed.

com.alibaba.nacos.api.exception.NacosException: Client not connected.
at com.alibaba.nacos.common.remote.client.RpcClient.asyncRequest(RpcClient.java:727)
at com.alibaba.nacos.core.cluster.remote.ClusterRpcClientProxy.asyncRequest(ClusterRpcClientProxy.java:192)
at com.alibaba.nacos.naming.consistency.ephemeral.distro.v2.DistroClientTransportAgent.syncData(DistroClientTransportAgent.java:95)
at com.alibaba.nacos.core.distributed.distro.task.execute.DistroSyncDeleteTask.doExecuteWithCallback(DistroSyncDeleteTask.java:60)
at com.alibaba.nacos.core.distributed.distro.task.execute.AbstractDistroExecuteTask.run(AbstractDistroExecuteTask.java:64)
at com.alibaba.nacos.common.task.engine.TaskExecuteWorker$InnerWorker.run(TaskExecuteWorker.java:116)

@Cczzzz
Copy link
Author

Cczzzz commented Oct 30, 2021

正常情况下,向宕机节点同步任务应该返回
throw new NacosException(CLIENT_INVALID_PARAM, "No rpc client related to member: " + member);
耗时 <1ms,而不是 去创建链接最终耗时300ms

@MajorHe1
Copy link
Collaborator

MajorHe1 commented Nov 3, 2021

如果确定是问题的话,能否提交一个PR修复一下?

@yb2020
Copy link

yb2020 commented Nov 10, 2021

订阅一下,在k8s上已经被这个集群数据不一致的问题,困扰很久了,server版本1.4.1

@linux0x5c
Copy link

同在K8S,重启一节点,数据出现不一致情况

@wangdongyun
Copy link

这个问题是队列阻塞导致的,目前是靠监控!

@qwertyuzzh
Copy link

我现在也遇到了同样的问题,有什么好的方式能解决吗?
使用nacos 镜像版本:nacos/nacos-server v2.1.0 b0a4aba28604 2 months ago
1658215234576

@whl12345
Copy link

whl12345 commented Aug 8, 2022

同遇到这个问题,服务端2.1.0版本

@KomachiSion
Copy link
Collaborator

Refer to #8099

@KomachiSion KomachiSion closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2022
@KomachiSion KomachiSion added status/duplicate This issue or pull request already exists and removed status/need feedback labels Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

10 participants