Issues while scaling down the nodes #317

anrajme · 2022-06-20T07:51:54Z

Hi there -

We had a few issues lately while the underlying K8s nodes scaled down. During this event, the pods are being evicted ( killed and recreated) on another node which is expected. However, stackstorm-ha reported a few issues. Initially, it was with the stateful set where the RabbitMQ node failures causing events stuck in a "Schduled" status forever. I'm trying to get rid of this trouble by shifting the RabbitMQ service to a Managed cloud service provider.

Now, the recent problem is with st2actionrunner, where the pod get evicted while executing a workflow. The event has been marked as "abandoned" and the workflow execution failed.

# st2 execution get 62b019ba420e073fb8f432c3
id: 62b019ba420e073fb8f432c3
action.ref: jira.update_field_value
context.user: xxxxx
parameters:
  field: customfield_14297
  issue_key: xx-96233
  value: Closing Jira 
status: abandoned
start_timestamp: Mon, 20 Jun 2022 06:54:50 UTC
end_timestamp:
log:
  - status: requested
    timestamp: '2022-06-20T06:54:50.171000Z'
  - status: scheduled
    timestamp: '2022-06-20T06:54:50.348000Z'
  - status: running
    timestamp: '2022-06-20T06:54:50.408000Z'
  - status: abandoned
    timestamp: '2022-06-20T06:54:50.535000Z'
result: None

In this case, though we still had another 4 healthy actionrunners running while the one failed where the workflow was executed.

Wondering whether this is expected behaviour and is acceptable for stackstorm-ha architecture ?

cheers!

The text was updated successfully, but these errors were encountered:

arm4b · 2022-06-20T11:27:27Z

Somewhat similar: StackStorm/st2#4716
It's an issue with the stackstorm engine itself handling the sudden stop of the actionrunners which were running tasks in the workflow.

anrajme · 2022-06-20T13:18:40Z

Thanks @armab. I have updated in the original issue StackStorm/st2#4716 . Looks like this is going to be a game-changer requirement, especially in the k8s ha environment where node/pod kill/restarts are comparatively more frequent than the traditional deployment model.

cognifloyd · 2023-01-28T04:28:43Z

Closing as a duplicate of StackStorm/st2#4716

cognifloyd closed this as not planned Won't fix, can't repro, duplicate, stale Jan 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues while scaling down the nodes #317

Issues while scaling down the nodes #317

anrajme commented Jun 20, 2022

arm4b commented Jun 20, 2022 •

edited

Loading

anrajme commented Jun 20, 2022

cognifloyd commented Jan 28, 2023

Issues while scaling down the nodes #317

Issues while scaling down the nodes #317

Comments

anrajme commented Jun 20, 2022

arm4b commented Jun 20, 2022 • edited Loading

anrajme commented Jun 20, 2022

cognifloyd commented Jan 28, 2023

arm4b commented Jun 20, 2022 •

edited

Loading