-
-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues while scaling down the nodes #317
Comments
Somewhat similar: StackStorm/st2#4716 |
Thanks @armab. I have updated in the original issue StackStorm/st2#4716 . Looks like this is going to be a game-changer requirement, especially in the k8s ha environment where node/pod kill/restarts are comparatively more frequent than the traditional deployment model. |
Closing as a duplicate of StackStorm/st2#4716 |
Hi there -
We had a few issues lately while the underlying K8s nodes scaled down. During this event, the pods are being evicted ( killed and recreated) on another node which is expected. However, stackstorm-ha reported a few issues. Initially, it was with the stateful set where the RabbitMQ node failures causing events stuck in a "Schduled" status forever. I'm trying to get rid of this trouble by shifting the RabbitMQ service to a Managed cloud service provider.
Now, the recent problem is with st2actionrunner, where the pod get evicted while executing a workflow. The event has been marked as "abandoned" and the workflow execution failed.
In this case, though we still had another 4 healthy actionrunners running while the one failed where the workflow was executed.
Wondering whether this is expected behaviour and is acceptable for stackstorm-ha architecture ?
cheers!
The text was updated successfully, but these errors were encountered: