-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report more details of unobtainable ShardLock #61255
Report more details of unobtainable ShardLock #61255
Conversation
Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates elastic#38807
Pinging @elastic/es-distributed (:Distributed/Store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -854,17 +860,23 @@ private void decWaitCount() { | |||
void acquire(long timeoutInMillis, final String details) throws ShardLockObtainFailedException { | |||
try { | |||
if (mutex.tryAcquire(timeoutInMillis, TimeUnit.MILLISECONDS)) { | |||
lockDetails = details; | |||
lockDetails = Tuple.tuple(System.nanoTime(), details); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: setDetails(details);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks both |
Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates #38807
Today a common reason for a
ShardLockObtainFailedException
is when ashard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.
The message in this case is typically as follows:
This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"
With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:
Relates #38807