Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary-replica resync can fail a busy primary #60359

Closed
ywelsch opened this issue Jul 29, 2020 · 3 comments · Fixed by #60634
Closed

Primary-replica resync can fail a busy primary #60359

ywelsch opened this issue Jul 29, 2020 · 3 comments · Fixed by #60634
Assignees
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Meta label for distributed team

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Jul 29, 2020

A primary-replica resync, which is triggered on a new primary after failover, will have a primary fail itself in case where it encounters an issue. As the resync is run by sending a primary action to the node itself, which is scheduled on the write thread pool, rejection from the write thread pool will cause the resync, and therefore the shard, to be failed.

@ywelsch ywelsch added >bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jul 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jul 29, 2020
@Tim-Brooks
Copy link
Contributor

I thought resync actions use forceExcecution?

@ywelsch
Copy link
Contributor Author

ywelsch commented Jul 30, 2020

Sorry for not providing enough context. Here's the stack trace:

[WARN ][o.e.i.e.Engine ] [node] [index][0] failed engine [exception during primary-replica resync]
org.elasticsearch.transport.RemoteTransportException: [node][XYZ:9300][internal:index/seq_no/resync[p]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener$1@6bfe8a7a on EsThreadPoolExecutor[name = XYZ/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@11e4d777[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 297897128]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_241]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_241]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener.onResponse(IndexShardOperationPermits.java:370) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener.onResponse(IndexShardOperationPermits.java:353) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:271) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits.lambda$releaseDelayedOperations$0(IndexShardOperationPermits.java:207) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_241]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_241]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241]
[DEBUG][o.e.i.IndexService ] [node] [index] [0] closing... (reason: [shard failure, reason [exception during primary-replica resync]])

The problem seems to be that IndexShard.acquirePrimaryOperationPermit is not passing through the forceExecution flag to the IndexShardOperationPermits. This means that on a primary promotion, resync can run into rejections.

Tim-Brooks added a commit that referenced this issue Aug 5, 2020
Currently the transport replication action does not propagate the force
execution parameter when acquiring the indexing permit. The logic to
acquire the index permit supports force execution, so this parameter
should be propagate. Fixes #60359.
Tim-Brooks added a commit that referenced this issue Aug 5, 2020
Currently the transport replication action does not propagate the force
execution parameter when acquiring the indexing permit. The logic to
acquire the index permit supports force execution, so this parameter
should be propagate. Fixes #60359.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Meta label for distributed team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants