Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature(TopologyOpsMonitor): Monitor decommission topology operation #8843

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aleksbykov
Copy link
Contributor

@aleksbykov aleksbykov commented Sep 25, 2024

Decommission operation could be terminated for different reason, but decommission process will be run on node and could successfully be decommissioned. But because of nemesis is terminated by exception cluster health validator could abort whole test run because node status is Decommissioning. This start happened with nemesis DecommissionStreamingErr, when log message could not be found and process is aborted by timeout, while decommission continue to run.

To Catch such case FailedDecommissionOperationMonitoring is presented It is ContextManager which could be used to safely run decommission operation and check node status and wait decommission will be finished if command aborted or terminated.

Fix: #8144

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

fruch
fruch previously approved these changes Oct 6, 2024
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Decommission operation could be terminated for different reason, but
decommission process will be run on node and could successfully be
decommissioned. But because of nemesis is terminated by exception
cluster health validator could abort whole test run because
node status is Decommissioning. This start happened with nemesis
DecommissionStreamingErr, when log message could not be found and
process is aborted by timeout, while decommission continue to run.

To Catch such case FailedDecommissionOperationMonitoring is presented
It is ContextManager which could be used to safely run decommission
operation and check node status and wait decommission will be finished
if command aborted or terminated.

Fix: scylladb#8144
@aleksbykov
Copy link
Contributor Author

@soyacz , can you take a look?

self.target_node = target_node
self.db_cluster: "BaseScyllaCluster" = target_node.parent_cluster
self.target_node_ip = target_node.ip_address
expected_exception = expected_exception or set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need that? From my understanding it is used to skip waiting for finish node decommission if test ended (KillNemesis exception raised).
At least rename it to something that reflects that (FailedDecommissionOperationMonitoring that expects KillNemesis is weird)

LOGGER.warning("Decommission failed with error: %s", traceback.format_exception(exc_type, exc_val, exc_tb))
decommission_in_progress = self.is_node_decommissioning()
if not decommission_in_progress:
self.db_cluster.verify_decommission(self.target_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_decommission terminates the node and does bunch of sttuff - how long it takes? Maybe we could also skip that in case of KillNemesis exception?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/6.1 Need backport to 6.1 backport/6.2 backport/2024.2 Need backport to 2024.2 Ready for review
Projects
None yet
3 participants