-
-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Death of st2actionrunner process causes action to remain running forever #4716
Comments
If the action runner process dies unexpectedly while an action execution is still executing, the action execution will be stuck in a running state because the action runner process didn't get a chance to update the database. We've recently added service discovery capability to the action runner. We will be adding garbage collection shortly to clean up these orphaned action executions and set them to something like an abandoned status. This will trigger the workflow execution to fail. When implemented, the service discovery feature will require user to configure a coordination backend such as redis server to work along side StackStorm. |
That sounds like a plan. Thanks. Meanwhile, how can I get abandoned actions to restart? That is crucial because most of our workflows run for a month or more, so if a box gets slammed, we see our workflows either fail or become stuck. It would be okay in our case to restart the specific failed action from the top, because our actions are all idempotent. |
For the orquesta workflow engine, rerunning/restarting WF from failed task is not supported yet. It is currently WIP and planned for a future release. |
Triggering the workflow to fail after one of the child tasks is abandoned seems like a sane default, but in many cases I'd like to be given the option to "retry" the abandoned task since many of our actionrunner failures are due to transient issues. Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this. |
This would be adequate for our use cases. Otherwise Orquesta basically makes it impossible to put the machines running the workflow engine in maintenance mode. We would prefer the workflow's complete state (including published variables and current threads) be captured in a persistent manner within the database, such that the workflow can restart if the workflow engine is moved to a different box. This would be essentially what Jenkins does w.r.t. pipelines when the master restarts -- it persists the state of the pipelines, then when it reconnects to slaves, it catches up with what the slaves were doing. |
I think there are different things being communicated here. As I understand, 1) there is the case where the action runner dies while executing an action for a task and that leads to the task and workflow stuck in running state. This is the original issue here. 2) You want to be able to rerun the task when the task and workflow execution failed as a result of 1. 3) You want to be able to pause the workflow execution, bring up another workflow engine, and resume the execution on the new server. Item 3 already works today. You can pause the workflow execution. The state of the execution is saved to MongoDB. Then you can bring up a new workflow engine using the same st2.conf and shut down the old workflow engine. Resume the workflow execution and the workflow engine will pick up where it left off. If you are running different versions of st2, then be careful there are no breaking changes in between versions. For item 1, per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running. For item 2, we have a WIP feature to rerun a workflow execution from one or more failed tasks. We will make sure this supports item 1 where the action execution is abandoned. |
Awesome. Now if the mistral language supported retries it'd be so awesome. Ultimately the big concern is that sometimes stopping st2 components cause actions to get stuck in state running and never end / become unpauseable or uncancellable. Hope this gets ironed out.
…On June 27, 2019 6:34:57 AM GMT+02:00, W Chan ***@***.***> wrote:
I think there are different things being communicated here. As I
understand, 1) there is the case where the action runner dies while
executing an action for a task and that leads to the task and workflow
stuck in running state. This is the original issue here. 2) You want
to be able to rerun the task when the task and workflow execution
failed as a result of #1. 3) You want to be able to pause the workflow
execution, bring up another workflow engine, and resume the execution
on the new server.
Item 3 already works today. You can pause the workflow execution. The
state of the execution is saved to MongoDB. Then you can bring up a
new workflow engine using the same st2.conf and shut down the old
workflow engine. Resume the workflow execution and the workflow engine
will pick up where it left off. If you are running different versions
of st2, then be careful there are no breaking changes in between
versions.
For item 1, per the solution described above, we plan to implement
garbage collection and will abandoned action execution where the action
runner that host it dies but the execution is stuck in running.
For item 2, we have a WIP feature to rerun a workflow execution from
one or more failed tasks. We will make sure this supports item 1 where
the action execution is abandoned.
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#4716 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
@m4dcoder can you explain the recently added GC process when a workflow task is stuck? i.e. if an actionrunner machine reboots. I'm reading the code and it appears that if a task gets stuck, the GC will kill the whole workflow execution. If this is the case, I don't think that is desired behavior. I think the stuck task should fail, but the workflow should be able to handle the failure, with a retry or other workflow path. |
Also, I think the action runner and workflow engine need to support a "warm shutdown" TERM signal to the process. The idea being that they should finish their work before they exit, minimizing orphaned actions or lost workflow state. For the workflow engine, this may mean initiating a pausing/paused before shutting the process. For the action runner, this may mean that it stops accepting any new work, and completes it's current running work before exiting (with a hard timeout value). We use this type of behavior for Celery workers today. See: http://docs.celeryproject.org/en/master/userguide/workers.html#stopping-the-worker |
+1, in Kubernetes environment all services die, respawn and get rescheduled to another nodes on a regular basis. It's a reality requirements to make StackStorm handle these cases as normal situation, especially thinking about it in HA context. |
Currently, GC will
Per the solution here for this issue which we haven't implement yet, when GC abandons the action execution, it has the same affect in failing the action execution and the task execution which will trigger whatever clean up defined in the workflow definition. |
@m4dcoder ok, is anyone working on GC for the action exection / actionrunner restart scenario? |
This is not currently prioritized for the next v3.2 release and we already started on v3.2. If this is something the community needs, st2 is open source and we welcome contribution. We will dedicate time to help and review with code contribution. |
Hi there - Checking whether this is still on the road map for any releases near soon? This requirement has real significance in the stackstorm-ha world especially when the nodes/pods get killed/restarted often in the k8s world compared to the traditional deployment model. |
Yeah. Kubernetes rollouts of new packs (using the st2packs sidecar containers built for the purpose) restart action runners, which means the restarted action runners usually leave actions behind, "running" as ghosts, which obviously torpedoes our long-running workflows. The garbage collector also does not collect tasks in "running" state by default either. And tasks whose executors have gone AWOL simply cannot be canceled from the UI (it directs the user to look at the developer console, see screenshot). To add to that complication, default retry behavior "when task is abandoned" is still not implemented in orquesta, when a transient failure of this type happens either -- we end up having to code retries on every task in each workflow, which is relatively easy to screw up for us workflow developers. These deficiencies make StackStorm usage in a modern production environment a very difficult pitch. Truly great in theory -- in practice very painful to deploy and maintain. |
Hi, is there any updates on this ticket? My team is trying to deploy StackStorm HA but we are running into this issue which isn't acceptable for our use case. :( |
looks like you have to also increase the terminationGracePeriodSeconds in your chart. The default is 30 seconds. |
Looks like there are similar settings for the workflowengine. again you will also have to set the terminationGracePeriodSeconds in your chart to a sane time. |
Make sure that your timeouts are all set correct vs your action timeouts. Most of our actions timeout after 10 minutes. So we set the following to at least allow the action timeouts to trigger before the graceful shutdowns.
With these settings we are hoping that worst case actions actually get abandoned because the actionrunner shutdown method will have enough time to abandon them before the k8 pod timeout hits 20 seconds later Also we set the st2workflowengine timeouts:
Also, when you run helm upgrade be sure to extend the timeout using |
|
Why the sensible options being discussed here are not defaults, it's a mystery to me. |
SUMMARY
Using StackStorm 3.0.1, if something kills an st2actionrunner process supervising a python-script action runner, and this action execution is part of a workflow execution, the action execution remains forever in state running regardless of
parameters: timeout
setting in the workflow.What I'd like to see, is the action being rescheduled to another st2actionrunner, or at the very least, timed out so that a retry in the workflow can deal with that problem.
(It is not clear either how StackStorm deals with the death of a st2actionrunner supervising an orquesta action runner.)
This is not an HA setup, but nothing in the code or documentation leads me to believe that the expected behavior is to just hang a workflow execution when the underlying action runner supervisor process is gone. I'm thinking a machine in an HA setup crashes while ongoing workflows are executing actions on that machine, and then all workflows whose actions were running there, just hang, never to even timeout.
We expect to be able to run StackStorm for weeks on end, with long-running workflows that survive the death or reboot of a machine that is part of the StackStorm cluster.
OS / ENVIRONMENT / INSTALL METHOD
Standard non-HA recommended setup in Ubuntu 16.04
STEPS TO REPRODUCE
Create workflow with one Python action that runs sleep 60 via subprocess.
Start workflow with st2 run.
Kill st2actionrunner supervising the Python action.
Wait forever.
The text was updated successfully, but these errors were encountered: