Feature(TopologyOpsMonitor): Monitor decommission topology operation #8843

aleksbykov · 2024-09-25T18:18:58Z

Decommission operation could be terminated for different reason, but decommission process will be run on node and could successfully be decommissioned. But because of nemesis is terminated by exception cluster health validator could abort whole test run because node status is Decommissioning. This start happened with nemesis DecommissionStreamingErr, when log message could not be found and process is aborted by timeout, while decommission continue to run.

To Catch such case FailedDecommissionOperationMonitoring is presented It is ContextManager which could be used to safely run decommission operation and check node status and wait decommission will be finished if command aborted or terminated.

Fix: #8144

Testing

Job passed

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

fruch

LGTM

Decommission operation could be terminated for different reason, but decommission process will be run on node and could successfully be decommissioned. But because of nemesis is terminated by exception cluster health validator could abort whole test run because node status is Decommissioning. This start happened with nemesis DecommissionStreamingErr, when log message could not be found and process is aborted by timeout, while decommission continue to run. To Catch such case FailedDecommissionOperationMonitoring is presented It is ContextManager which could be used to safely run decommission operation and check node status and wait decommission will be finished if command aborted or terminated. Fix: scylladb#8144

aleksbykov · 2024-10-07T14:07:47Z

@soyacz , can you take a look?

soyacz · 2024-10-07T14:25:37Z

sdcm/utils/topology_ops.py

+ self.target_node = target_node
+ self.db_cluster: "BaseScyllaCluster" = target_node.parent_cluster
+ self.target_node_ip = target_node.ip_address
+ expected_exception = expected_exception or set()


do we need that? From my understanding it is used to skip waiting for finish node decommission if test ended (KillNemesis exception raised).
At least rename it to something that reflects that (FailedDecommissionOperationMonitoring that expects KillNemesis is weird)

soyacz · 2024-10-07T14:26:32Z

sdcm/utils/topology_ops.py

+ LOGGER.warning("Decommission failed with error: %s", traceback.format_exception(exc_type, exc_val, exc_tb))
+ decommission_in_progress = self.is_node_decommissioning()
+ if not decommission_in_progress:
+ self.db_cluster.verify_decommission(self.target_node)


verify_decommission terminates the node and does bunch of sttuff - how long it takes? Maybe we could also skip that in case of KillNemesis exception?

aleksbykov added Ready for review backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 backport/6.2 labels Sep 25, 2024

github-actions bot assigned aleksbykov Sep 25, 2024

aleksbykov force-pushed the fix-issue-8144 branch from bf17a6f to b0788d4 Compare September 25, 2024 18:35

aleksbykov marked this pull request as ready for review September 26, 2024 12:31

aleksbykov requested review from fruch, enaydanov, vponomaryov and soyacz September 26, 2024 12:31

fruch previously approved these changes Oct 6, 2024

View reviewed changes

aleksbykov dismissed fruch’s stale review via b3f5535 October 7, 2024 09:49

aleksbykov force-pushed the fix-issue-8144 branch from b0788d4 to b3f5535 Compare October 7, 2024 09:49

aleksbykov force-pushed the fix-issue-8144 branch from b3f5535 to 0caeb17 Compare October 7, 2024 10:03

soyacz reviewed Oct 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature(TopologyOpsMonitor): Monitor decommission topology operation #8843

Feature(TopologyOpsMonitor): Monitor decommission topology operation #8843

aleksbykov commented Sep 25, 2024 •

edited

Loading

fruch left a comment

aleksbykov commented Oct 7, 2024

soyacz Oct 7, 2024

soyacz Oct 7, 2024

Feature(TopologyOpsMonitor): Monitor decommission topology operation #8843

Are you sure you want to change the base?

Feature(TopologyOpsMonitor): Monitor decommission topology operation #8843

Conversation

aleksbykov commented Sep 25, 2024 • edited Loading

Testing

PR pre-checks (self review)

Reminders

fruch left a comment

Choose a reason for hiding this comment

aleksbykov commented Oct 7, 2024

soyacz Oct 7, 2024

Choose a reason for hiding this comment

soyacz Oct 7, 2024

Choose a reason for hiding this comment

aleksbykov commented Sep 25, 2024 •

edited

Loading