Report more details of unobtainable ShardLock #61255

DaveCTurner · 2020-08-18T09:03:12Z

Today a common reason for a ShardLockObtainFailedException is when a
shard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.

The message in this case is typically as follows:

obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]

This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"

With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:

obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms]

Relates #38807

Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates elastic#38807

elasticmachine · 2020-08-18T09:03:14Z

Pinging @elastic/es-distributed (:Distributed/Store)

original-brownbear

LGTM

original-brownbear · 2020-08-18T09:55:00Z

server/src/main/java/org/elasticsearch/env/NodeEnvironment.java

@@ -854,17 +860,23 @@ private void decWaitCount() {
 void acquire(long timeoutInMillis, final String details) throws ShardLockObtainFailedException {
 try {
 if (mutex.tryAcquire(timeoutInMillis, TimeUnit.MILLISECONDS)) {
- lockDetails = details;
+ lockDetails = Tuple.tuple(System.nanoTime(), details);


NIT: setDetails(details);

dakrone

LGTM

DaveCTurner · 2020-08-19T05:36:02Z

Thanks both

Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates #38807

DaveCTurner added >enhancement :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v8.0.0 v7.10.0 labels Aug 18, 2020

DaveCTurner requested review from dakrone and original-brownbear August 18, 2020 09:03

elasticmachine added the Team:Distributed Meta label for distributed team label Aug 18, 2020

original-brownbear approved these changes Aug 18, 2020

View reviewed changes

DaveCTurner added 2 commits August 18, 2020 10:59

Merge branch 'master' into 2020-08-18-log-shard-lock-age

ab94d42

CR

48bf874

dakrone approved these changes Aug 18, 2020

View reviewed changes

DaveCTurner merged commit 98213df into elastic:master Aug 19, 2020

DaveCTurner deleted the 2020-08-18-log-shard-lock-age branch August 19, 2020 05:36

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report more details of unobtainable ShardLock #61255

Report more details of unobtainable ShardLock #61255

DaveCTurner commented Aug 18, 2020

elasticmachine commented Aug 18, 2020

original-brownbear left a comment

original-brownbear Aug 18, 2020

dakrone left a comment

DaveCTurner commented Aug 19, 2020

Report more details of unobtainable ShardLock #61255

Report more details of unobtainable ShardLock #61255

Conversation

DaveCTurner commented Aug 18, 2020

elasticmachine commented Aug 18, 2020

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Aug 18, 2020

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

DaveCTurner commented Aug 19, 2020