Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues while scaling down the nodes #317

Closed
anrajme opened this issue Jun 20, 2022 · 3 comments
Closed

Issues while scaling down the nodes #317

anrajme opened this issue Jun 20, 2022 · 3 comments

Comments

@anrajme
Copy link

anrajme commented Jun 20, 2022

Hi there -

We had a few issues lately while the underlying K8s nodes scaled down. During this event, the pods are being evicted ( killed and recreated) on another node which is expected. However, stackstorm-ha reported a few issues. Initially, it was with the stateful set where the RabbitMQ node failures causing events stuck in a "Schduled" status forever. I'm trying to get rid of this trouble by shifting the RabbitMQ service to a Managed cloud service provider.

Now, the recent problem is with st2actionrunner, where the pod get evicted while executing a workflow. The event has been marked as "abandoned" and the workflow execution failed.

image

# st2 execution get 62b019ba420e073fb8f432c3
id: 62b019ba420e073fb8f432c3
action.ref: jira.update_field_value
context.user: xxxxx
parameters:
  field: customfield_14297
  issue_key: xx-96233
  value: Closing Jira 
status: abandoned
start_timestamp: Mon, 20 Jun 2022 06:54:50 UTC
end_timestamp:
log:
  - status: requested
    timestamp: '2022-06-20T06:54:50.171000Z'
  - status: scheduled
    timestamp: '2022-06-20T06:54:50.348000Z'
  - status: running
    timestamp: '2022-06-20T06:54:50.408000Z'
  - status: abandoned
    timestamp: '2022-06-20T06:54:50.535000Z'
result: None

In this case, though we still had another 4 healthy actionrunners running while the one failed where the workflow was executed.

Wondering whether this is expected behaviour and is acceptable for stackstorm-ha architecture ?

cheers!

@arm4b
Copy link
Member

arm4b commented Jun 20, 2022

Somewhat similar: StackStorm/st2#4716
It's an issue with the stackstorm engine itself handling the sudden stop of the actionrunners which were running tasks in the workflow.

@anrajme
Copy link
Author

anrajme commented Jun 20, 2022

Thanks @armab. I have updated in the original issue StackStorm/st2#4716 . Looks like this is going to be a game-changer requirement, especially in the k8s ha environment where node/pod kill/restarts are comparatively more frequent than the traditional deployment model.

@cognifloyd
Copy link
Member

Closing as a duplicate of StackStorm/st2#4716

@cognifloyd cognifloyd closed this as not planned Won't fix, can't repro, duplicate, stale Jan 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants