fix cluster not able to spin up issue when disk usage exceeds threshold #15258

zane-neo · 2024-08-15T03:24:48Z

Description

root cause:

Observability plugin creates system index during cluster startup: link, and if cluster has index block state persisted, during the node startup this block will be applied to cluster state and index creation fails, then this exception is been thrown.
OpenSearch starts cluster in this method and the error will be thrown again util the OpenSearch startup command execution finished.
At the moment of the startup command complete, the JVM only has daemon threads without any non-daemon threads, and thus JVM exits, then cluster been shut down.

Changing the code of cluster start part to first start the keepAliveThread which is a non-daemon thread to make sure at least one non-daemon thread is running thus the JVM won't exit.

Related Issues

#14791

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-08-15T03:35:55Z

❌ Gradle check result for d9096b2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

dblock · 2024-08-15T19:00:19Z

Thanks for opening this! Would like someone who understands the code, maybe @andrross or @mch2 to take a look please?

In the meantime @zane-neo you should get this PR to green, with a CHANGELOG, etc.

zane-neo · 2024-08-16T01:51:23Z

Thanks for opening this! Would like someone who understands the code, maybe @andrross or @mch2 to take a look please?

In the meantime @zane-neo you should get this PR to green, with a CHANGELOG, etc.

Sure, this is a draft PR to prove this can fix the issue, but this fix has drawbacks, e.g. it changes the ClusterBlocks.java's field modifier, I'll figure out a better approach to fix this and make the checks green.

github-actions · 2024-08-16T06:40:03Z

❌ Gradle check result for 7ce9886: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-16T07:52:45Z

❌ Gradle check result for 3e8df68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

zane-neo · 2024-10-15T08:04:56Z

❌ Gradle check result for c81fbc1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

java.lang.AssertionError: expected:<11001+ hits> but was:<11000 hits>
	at __randomizedtesting.SeedInfo.seed([10BFA3E8E0AFB2EA:5976454878AC632A]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:120)
	at org.junit.Assert.assertEquals(Assert.java:146)
	at org.opensearch.search.approximate.ApproximatePointRangeQueryTests.testApproximateRangeWithSizeOverDefault(ApproximatePointRangeQueryTests.java:191)

This seems a flaky test not related to the change.

github-actions · 2024-10-15T15:42:09Z

✅ Gradle check result for c81fbc1: SUCCESS

andrross · 2024-10-15T22:06:49Z

server/src/main/java/org/opensearch/bootstrap/Bootstrap.java

 keepAliveThread.start();
+ node.start();


Naively speaking, if node.start() doesn't succeeded, then there is nothing to keep alive and the JVM shutting down seems like the right thing to do. It seems like this change could lead to a partially-started or not-at-all-started node running indefinitely because the keepAliveThread will just keep it alive in some sort of zombie state after node.start() failed. What am I missing?

Naively responding, I've no objection to the JVM shutting down, but having spent way too many hours trying to figure out why the JVM shut down in various situations, I'd be really happy to have at least one thread faithfully logging whatever issues caused it to shut down, before shutting itself down.

Not all node.start() failing means the JVM should quit, like the case in the corresponding issue, if we're able to keep JVM running, user can fix this issue simply by changing the cluster settings. So the fix is to provide user the capability to interact with the running cluster(even partially-started).
And for those cases node.start() failing and the cluster is not available at all case, this fix doesn't have real impact because in the end user needs to check error/fatal logs to fix the root cause. This change doesn't add any blocking to user but w/o this change, the first case users are blocked.
So in all I think this is a positive enhancement to user.

like the case in the corresponding issue, if we're able to keep JVM running, user can fix this issue simply by changing the cluster settings.

This seems like a really blunt change for an extremely specific case. In general, quitting the JVM and letting it restart is preferable to continuing to run in a partially-started state. Prior to this change, a transient failure in node.start() might automatically recover because the JVM would quit and whatever is monitoring the process would restart it. Now there is a risk (in my opinion) that the process will continue running in a non-functional partially-started state that will require human intervention to resolve. How do we know that is not a risk here?

The assumption that anything useful can be done with a partially started node was true in the specific case mentioned but is not true in general.

but having spent way too many hours trying to figure out why the JVM shut down in various situations, I'd be really happy to have at least one thread faithfully logging whatever issues caused it to shut down, before shutting itself down.

@dbwiddis Not logging why the JVM shutdown is clearly a problem. However, I'd argue this change might make things considerably worse. Instead of knowing there's a problem (because the process restarted) we'll instead have a partially-started node running in a crippled state with no indication that something went wrong during startup.

I agree in some real edge case a half-started node are healthy from monitor might be true, but it's more convincible if you can provide examples on this. Let's discuss on other possible solutions first:

Confirm with observability if it's possible to not throw exception to block the cluster startup, I have created an issue: [BUG] Observability plugin's system index creating is impacting on OpenSearch node starts observability#1872

Change the node start code to add disk usage check and update the state when cluster manager is elected, this needs much more efforts since the state transition during startup is complicated and the disk usage check code might need refactor. This also adds a lot of code review effort, I hope you can help on this if this works out.

I believe the second one would be the long term solution, and the first one could be short term. If either one works out, I don't have objection to rollback this change. WDYT?

Solution (1) definitely needs to be fixed, particularly in light of #16340, and not just in the short term. (And I just realized now that bug is more serious than I thought it was because of this.)

I agree (2) is the right direction to go generally, but don't see much detail there.

If either one works out, I don't have objection to rollback this change.

Can we roll it back sooner rather than later? The unmerged backport/lack of changelog will make every backport PR creation fail until the main and 2.x changelogs sync up.

I can create the revert PR now, and I'll look into the solution 2 for a long term solution.

Thanks @zane-neo . Please also consider solution 1 and/or the bug I reported. One of the two of those issues should be fixed ASAP.

Signed-off-by: zane-neo <[email protected]>

github-actions · 2024-10-16T06:27:26Z

❌ Gradle check result for af9d70d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…ncy issue caused test failure Signed-off-by: zane-neo <[email protected]>

github-actions · 2024-10-16T08:10:23Z

❌ Gradle check result for 7f24452: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-10-16T11:51:03Z

✅ Gradle check result for 7f24452: SUCCESS

dblock

I'm going to merge this as it has been sitting here for a while and is a visible bug fix, but @andrross do let @zane-neo know if you want more changes on top of it.

…ld (#15258) * fix cluster not able to spin up issue when disk usage exceeds threshold Signed-off-by: zane-neo <[email protected]> * Add comment to changes Signed-off-by: zane-neo <[email protected]> * Add UT to ensure the keepAliveThread starts before node starts Signed-off-by: zane-neo <[email protected]> * remove unused imports Signed-off-by: zane-neo <[email protected]> * Fix forbidden API calls check failed issue Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * change setInstance method to static Signed-off-by: zane-neo <[email protected]> * Add countdownlatch in test to coordinate the thread to avoid concureency issue caused test failure Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 62081f2) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…ld (opensearch-project#15258) * fix cluster not able to spin up issue when disk usage exceeds threshold Signed-off-by: zane-neo <[email protected]> * Add comment to changes Signed-off-by: zane-neo <[email protected]> * Add UT to ensure the keepAliveThread starts before node starts Signed-off-by: zane-neo <[email protected]> * remove unused imports Signed-off-by: zane-neo <[email protected]> * Fix forbidden API calls check failed issue Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * format code Signed-off-by: zane-neo <[email protected]> * change setInstance method to static Signed-off-by: zane-neo <[email protected]> * Add countdownlatch in test to coordinate the thread to avoid concureency issue caused test failure Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]>

… threshold (opensearch-project#15258)" This reverts commit 62081f2.

…x the issue. Signed-off-by: zane-neo <[email protected]>

Signed-off-by: zane-neo <[email protected]>

dblock · 2024-10-18T13:14:44Z

Thanks @zane-neo for hanging in here with us and @dbwiddis and @andrross for your thoughtful comments and help with moving forward!

zane-neo mentioned this pull request Aug 15, 2024

[BUG] OpenSearch not starting - [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block]; #14791

Open

zane-neo force-pushed the fix-cluster-unable-spin-up branch from 7ce9886 to 3e8df68 Compare August 16, 2024 06:35

zane-neo marked this pull request as ready for review August 16, 2024 06:36

zane-neo requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, dblock, dbwiddis, gbbafna, kotwanikunal, linuxpi, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami and VachaShah as code owners August 16, 2024 06:36

zane-neo changed the title ~~draft PR to fix cluster not able to spin up issue when disk usage exceeds threshold~~ fix cluster not able to spin up issue when disk usage exceeds threshold Aug 16, 2024

opensearch-ci-bot mentioned this pull request Oct 15, 2024

[AUTOCUT] Gradle Check Flaky Test Report for ApproximatePointRangeQueryTests #15807

Open

andrross reviewed Oct 15, 2024

View reviewed changes

zane-neo added 8 commits October 16, 2024 14:13

fix cluster not able to spin up issue when disk usage exceeds threshold

f38dba8

Signed-off-by: zane-neo <[email protected]>

Add comment to changes

5dbdab9

Signed-off-by: zane-neo <[email protected]>

Add UT to ensure the keepAliveThread starts before node starts

4fcd310

Signed-off-by: zane-neo <[email protected]>

remove unused imports

2ad0126

Signed-off-by: zane-neo <[email protected]>

Fix forbidden API calls check failed issue

acebaf3

Signed-off-by: zane-neo <[email protected]>

format code

c29a02b

Signed-off-by: zane-neo <[email protected]>

format code

2b3cd47

Signed-off-by: zane-neo <[email protected]>

change setInstance method to static

af9d70d

Signed-off-by: zane-neo <[email protected]>

zane-neo force-pushed the fix-cluster-unable-spin-up branch from c81fbc1 to af9d70d Compare October 16, 2024 06:13

Add countdownlatch in test to coordinate the thread to avoid concuree…

7f24452

…ncy issue caused test failure Signed-off-by: zane-neo <[email protected]>

dblock approved these changes Oct 16, 2024

View reviewed changes

dblock merged commit 62081f2 into opensearch-project:main Oct 16, 2024
38 checks passed

opensearch-trigger-bot bot mentioned this pull request Oct 16, 2024

[Backport 2.x] fix cluster not able to spin up issue when disk usage exceeds threshold #16351

Closed

zane-neo mentioned this pull request Oct 18, 2024

[BUG] Observability plugin's system index creating is impacting on OpenSearch node starts opensearch-project/observability#1872

Open

zane-neo added a commit to zane-neo/OpenSearch that referenced this pull request Oct 18, 2024

Revert "fix cluster not able to spin up issue when disk usage exceeds…

fef7477

… threshold (opensearch-project#15258)" This reverts commit 62081f2.

zane-neo added a commit to zane-neo/OpenSearch that referenced this pull request Oct 18, 2024

Revert "fix cluster not able to spin up issue when disk usage exceeds…

916c50c

… threshold (opensearch-project#15258)" This reverts commit 62081f2.

zane-neo mentioned this pull request Oct 18, 2024

Revert #15258 to figure out a better approach to fix the issue. #16377

Merged

3 tasks

zane-neo added a commit to zane-neo/OpenSearch that referenced this pull request Oct 18, 2024

Revert opensearch-project#15258 to figure out a better approach to fi…

9ad7109

…x the issue. Signed-off-by: zane-neo <[email protected]>

dbwiddis pushed a commit that referenced this pull request Oct 18, 2024

Revert #15258 to figure out a better approach to fix the issue. (#16377)

0bded88

Signed-off-by: zane-neo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix cluster not able to spin up issue when disk usage exceeds threshold #15258

fix cluster not able to spin up issue when disk usage exceeds threshold #15258

zane-neo commented Aug 15, 2024 •

edited

Loading

github-actions bot commented Aug 15, 2024

dblock commented Aug 15, 2024 •

edited

Loading

zane-neo commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

zane-neo commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

andrross Oct 15, 2024

dbwiddis Oct 16, 2024

zane-neo Oct 16, 2024

andrross Oct 16, 2024 •

edited

Loading

andrross Oct 16, 2024

zane-neo Oct 18, 2024 •

edited

Loading

dbwiddis Oct 18, 2024

dbwiddis Oct 18, 2024

zane-neo Oct 18, 2024

dbwiddis Oct 18, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

dblock left a comment

dblock commented Oct 18, 2024

fix cluster not able to spin up issue when disk usage exceeds threshold #15258

fix cluster not able to spin up issue when disk usage exceeds threshold #15258

Conversation

zane-neo commented Aug 15, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Aug 15, 2024

dblock commented Aug 15, 2024 • edited Loading

zane-neo commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

zane-neo commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrross Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zane-neo Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

dblock left a comment

Choose a reason for hiding this comment

dblock commented Oct 18, 2024

zane-neo commented Aug 15, 2024 •

edited

Loading

dblock commented Aug 15, 2024 •

edited

Loading

andrross Oct 16, 2024 •

edited

Loading

zane-neo Oct 18, 2024 •

edited

Loading