Primary-replica resync can fail a busy primary #60359

ywelsch · 2020-07-29T06:45:49Z

A primary-replica resync, which is triggered on a new primary after failover, will have a primary fail itself in case where it encounters an issue. As the resync is run by sending a primary action to the node itself, which is scheduled on the write thread pool, rejection from the write thread pool will cause the resync, and therefore the shard, to be failed.

elasticmachine · 2020-07-29T06:45:50Z

Pinging @elastic/es-distributed (:Distributed/CRUD)

Tim-Brooks · 2020-07-30T02:32:40Z

I thought resync actions use forceExcecution?

ywelsch · 2020-07-30T06:56:20Z

Sorry for not providing enough context. Here's the stack trace:

[WARN ][o.e.i.e.Engine ] [node] [index][0] failed engine [exception during primary-replica resync]
org.elasticsearch.transport.RemoteTransportException: [node][XYZ:9300][internal:index/seq_no/resync[p]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener$1@6bfe8a7a on EsThreadPoolExecutor[name = XYZ/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@11e4d777[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 297897128]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_241]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_241]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener.onResponse(IndexShardOperationPermits.java:370) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener.onResponse(IndexShardOperationPermits.java:353) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:271) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits.lambda$releaseDelayedOperations$0(IndexShardOperationPermits.java:207) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_241]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_241]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241]
[DEBUG][o.e.i.IndexService ] [node] [index] [0] closing... (reason: [shard failure, reason [exception during primary-replica resync]])

The problem seems to be that IndexShard.acquirePrimaryOperationPermit is not passing through the forceExecution flag to the IndexShardOperationPermits. This means that on a primary promotion, resync can run into rejections.

Currently the transport replication action does not propagate the force execution parameter when acquiring the indexing permit. The logic to acquire the index permit supports force execution, so this parameter should be propagate. Fixes #60359.

ywelsch added >bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jul 29, 2020

elasticmachine added the Team:Distributed Meta label for distributed team label Jul 29, 2020

ywelsch assigned Tim-Brooks Jul 29, 2020

Tim-Brooks mentioned this issue Aug 4, 2020

Propagate forceExecution when acquiring permit #60634

Merged

Tim-Brooks closed this as completed in #60634 Aug 5, 2020

Mpdreamz mentioned this issue Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this issue Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primary-replica resync can fail a busy primary #60359

Primary-replica resync can fail a busy primary #60359

ywelsch commented Jul 29, 2020

elasticmachine commented Jul 29, 2020

Tim-Brooks commented Jul 30, 2020

ywelsch commented Jul 30, 2020

Primary-replica resync can fail a busy primary #60359

Primary-replica resync can fail a busy primary #60359

Comments

ywelsch commented Jul 29, 2020

elasticmachine commented Jul 29, 2020

Tim-Brooks commented Jul 30, 2020

ywelsch commented Jul 30, 2020