Death of st2actionrunner process causes action to remain running forever #4716

Rudd-O · 2019-06-21T00:59:12Z

SUMMARY

Using StackStorm 3.0.1, if something kills an st2actionrunner process supervising a python-script action runner, and this action execution is part of a workflow execution, the action execution remains forever in state running regardless of parameters: timeout setting in the workflow.

What I'd like to see, is the action being rescheduled to another st2actionrunner, or at the very least, timed out so that a retry in the workflow can deal with that problem.

(It is not clear either how StackStorm deals with the death of a st2actionrunner supervising an orquesta action runner.)

This is not an HA setup, but nothing in the code or documentation leads me to believe that the expected behavior is to just hang a workflow execution when the underlying action runner supervisor process is gone. I'm thinking a machine in an HA setup crashes while ongoing workflows are executing actions on that machine, and then all workflows whose actions were running there, just hang, never to even timeout.

We expect to be able to run StackStorm for weeks on end, with long-running workflows that survive the death or reboot of a machine that is part of the StackStorm cluster.

OS / ENVIRONMENT / INSTALL METHOD

Standard non-HA recommended setup in Ubuntu 16.04

STEPS TO REPRODUCE

Create workflow with one Python action that runs sleep 60 via subprocess.
Start workflow with st2 run.
Kill st2actionrunner supervising the Python action.
Wait forever.

The text was updated successfully, but these errors were encountered:

m4dcoder · 2019-06-21T07:21:50Z

If the action runner process dies unexpectedly while an action execution is still executing, the action execution will be stuck in a running state because the action runner process didn't get a chance to update the database. We've recently added service discovery capability to the action runner. We will be adding garbage collection shortly to clean up these orphaned action executions and set them to something like an abandoned status. This will trigger the workflow execution to fail. When implemented, the service discovery feature will require user to configure a coordination backend such as redis server to work along side StackStorm.

Rudd-O · 2019-06-24T13:16:53Z

That sounds like a plan. Thanks.

Meanwhile, how can I get abandoned actions to restart? That is crucial because most of our workflows run for a month or more, so if a box gets slammed, we see our workflows either fail or become stuck. It would be okay in our case to restart the specific failed action from the top, because our actions are all idempotent.

m4dcoder · 2019-06-24T19:11:15Z

For the orquesta workflow engine, rerunning/restarting WF from failed task is not supported yet. It is currently WIP and planned for a future release.

trstruth · 2019-06-24T19:34:24Z

Triggering the workflow to fail after one of the child tasks is abandoned seems like a sane default, but in many cases I'd like to be given the option to "retry" the abandoned task since many of our actionrunner failures are due to transient issues. Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this.

Rudd-O · 2019-06-26T04:36:11Z

Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this.

This would be adequate for our use cases. Otherwise Orquesta basically makes it impossible to put the machines running the workflow engine in maintenance mode.

We would prefer the workflow's complete state (including published variables and current threads) be captured in a persistent manner within the database, such that the workflow can restart if the workflow engine is moved to a different box. This would be essentially what Jenkins does w.r.t. pipelines when the master restarts -- it persists the state of the pipelines, then when it reconnects to slaves, it catches up with what the slaves were doing.

m4dcoder · 2019-06-27T04:34:48Z

I think there are different things being communicated here. As I understand, 1) there is the case where the action runner dies while executing an action for a task and that leads to the task and workflow stuck in running state. This is the original issue here. 2) You want to be able to rerun the task when the task and workflow execution failed as a result of 1. 3) You want to be able to pause the workflow execution, bring up another workflow engine, and resume the execution on the new server.

Item 3 already works today. You can pause the workflow execution. The state of the execution is saved to MongoDB. Then you can bring up a new workflow engine using the same st2.conf and shut down the old workflow engine. Resume the workflow execution and the workflow engine will pick up where it left off. If you are running different versions of st2, then be careful there are no breaking changes in between versions.

For item 1, per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running.

For item 2, we have a WIP feature to rerun a workflow execution from one or more failed tasks. We will make sure this supports item 1 where the action execution is abandoned.

Rudd-O · 2019-07-25T01:57:25Z

Awesome. Now if the mistral language supported retries it'd be so awesome. Ultimately the big concern is that sometimes stopping st2 components cause actions to get stuck in state running and never end / become unpauseable or uncancellable. Hope this gets ironed out.

…

On June 27, 2019 6:34:57 AM GMT+02:00, W Chan ***@***.***> wrote: I think there are different things being communicated here. As I understand, 1) there is the case where the action runner dies while executing an action for a task and that leads to the task and workflow stuck in running state. This is the original issue here. 2) You want to be able to rerun the task when the task and workflow execution failed as a result of #1. 3) You want to be able to pause the workflow execution, bring up another workflow engine, and resume the execution on the new server. Item 3 already works today. You can pause the workflow execution. The state of the execution is saved to MongoDB. Then you can bring up a new workflow engine using the same st2.conf and shut down the old workflow engine. Resume the workflow execution and the workflow engine will pick up where it left off. If you are running different versions of st2, then be careful there are no breaking changes in between versions. For item 1, per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running. For item 2, we have a WIP feature to rerun a workflow execution from one or more failed tasks. We will make sure this supports item 1 where the action execution is abandoned. -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #4716 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

johnarnold · 2019-08-09T01:57:25Z

@m4dcoder can you explain the recently added GC process when a workflow task is stuck? i.e. if an actionrunner machine reboots.

I'm reading the code and it appears that if a task gets stuck, the GC will kill the whole workflow execution. If this is the case, I don't think that is desired behavior. I think the stuck task should fail, but the workflow should be able to handle the failure, with a retry or other workflow path.

johnarnold · 2019-08-09T02:06:22Z

Also, I think the action runner and workflow engine need to support a "warm shutdown" TERM signal to the process. The idea being that they should finish their work before they exit, minimizing orphaned actions or lost workflow state.

For the workflow engine, this may mean initiating a pausing/paused before shutting the process.

For the action runner, this may mean that it stops accepting any new work, and completes it's current running work before exiting (with a hard timeout value).

We use this type of behavior for Celery workers today. See: http://docs.celeryproject.org/en/master/userguide/workers.html#stopping-the-worker

arm4b · 2019-08-09T13:01:57Z

+1, in Kubernetes environment all services die, respawn and get rescheduled to another nodes on a regular basis.

It's a reality requirements to make StackStorm handle these cases as normal situation, especially thinking about it in HA context.

m4dcoder · 2019-08-09T19:18:25Z

Currently, GC will cancel the workflow execution if it has been pass max idle time without any activity (i.e. active == task execution still executing). The current GC does not cover the use cases where action runner died/reboot in the meeting of executing an action. When this happens, the action execution will be stuck in a running state and so is the corresponding task execution record. This will not trigger the current GC to clean up the workflow execution. Note this GC functionality is disabled by default in v3.1.

per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running.

Per the solution here for this issue which we haven't implement yet, when GC abandons the action execution, it has the same affect in failing the action execution and the task execution which will trigger whatever clean up defined in the workflow definition.

johnarnold · 2019-08-09T19:55:20Z

@m4dcoder ok, is anyone working on GC for the action exection / actionrunner restart scenario?

m4dcoder · 2019-08-09T20:26:36Z

This is not currently prioritized for the next v3.2 release and we already started on v3.2. If this is something the community needs, st2 is open source and we welcome contribution. We will dedicate time to help and review with code contribution.

anrajme · 2022-06-20T12:56:11Z

Hi there - Checking whether this is still on the road map for any releases near soon? This requirement has real significance in the stackstorm-ha world especially when the nodes/pods get killed/restarted often in the k8s world compared to the traditional deployment model.

DFINITYManu · 2023-06-05T13:49:15Z

Yeah. Kubernetes rollouts of new packs (using the st2packs sidecar containers built for the purpose) restart action runners, which means the restarted action runners usually leave actions behind, "running" as ghosts, which obviously torpedoes our long-running workflows. The garbage collector also does not collect tasks in "running" state by default either. And tasks whose executors have gone AWOL simply cannot be canceled from the UI (it directs the user to look at the developer console, see screenshot).

To add to that complication, default retry behavior "when task is abandoned" is still not implemented in orquesta, when a transient failure of this type happens either -- we end up having to code retries on every task in each workflow, which is relatively easy to screw up for us workflow developers.

These deficiencies make StackStorm usage in a modern production environment a very difficult pitch. Truly great in theory -- in practice very painful to deploy and maintain.

bell-manz · 2023-07-19T20:18:07Z

Hi, is there any updates on this ticket? My team is trying to deploy StackStorm HA but we are running into this issue which isn't acceptable for our use case. :(

guzzijones · 2023-07-19T22:43:37Z

If the kill signal is sent to an actionrunner it should wait till the action finishes if you have graceful shutdown on. Do you have gracefull shutdown enabled in the config? There is also an exit timeout and sleep delay setting.code

guzzijones · 2023-07-19T22:55:44Z

looks like you have to also increase the terminationGracePeriodSeconds in your chart. The default is 30 seconds.

guzzijones · 2023-07-19T23:00:12Z

Looks like there are similar settings for the workflowengine. again you will also have to set the terminationGracePeriodSeconds in your chart to a sane time.

guzzijones · 2023-10-31T15:46:40Z

Make sure that your timeouts are all set correct vs your action timeouts.

Most of our actions timeout after 10 minutes. So we set the following to at least allow the action timeouts to trigger before the graceful shutdowns.

action timeouts: 600 seconds for most of our actions. we have a couple set to 900, but we want to cover most.
actionrunner: gracefull timeout settings in values.yaml in config settings. set it to a bit longer than the action timeout.

 [actionrunner]
      graceful_shutdown = True
      exit_still_active_check = 610
      still_active_check_interval = 10

termination Grace Period Seconds in values.haml for action runner

 st2actionrunner:
    terminationGracePeriodSeconds: 630

With these settings we are hoping that worst case actions actually get abandoned because the actionrunner shutdown method will have enough time to abandon them before the k8 pod timeout hits 20 seconds later

Also we set the st2workflowengine timeouts:

  st2workflowengine:
    terminationGracePeriodSeconds: 630

Also, when you run helm upgrade be sure to extend the timeout using --timeout 20m

guzzijones · 2023-12-06T19:27:47Z

2 more notes:

do NOT put inline comments in your config file
you must enable coordination | service_registry = True in the config for graceful shutdown to wait for actions to finish.

Rudd-O · 2023-12-20T10:22:56Z

Why the sensible options being discussed here are not defaults, it's a mystery to me.

arm4b added runners bug labels Jun 21, 2019

arm4b added HA StackStorm in High Availability K8s labels Aug 26, 2019

arm4b mentioned this issue Nov 8, 2019

st2actionrunner reloads on adding a new pack StackStorm/stackstorm-k8s#95

Open

arm4b mentioned this issue Feb 23, 2021

Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Merged

8 tasks

arm4b mentioned this issue Jun 20, 2022

Issues while scaling down the nodes StackStorm/stackstorm-k8s#317

Closed

guzzijones closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Death of st2actionrunner process causes action to remain running forever #4716

Death of st2actionrunner process causes action to remain running forever #4716

Rudd-O commented Jun 21, 2019 •

edited

Loading

m4dcoder commented Jun 21, 2019

Rudd-O commented Jun 24, 2019

m4dcoder commented Jun 24, 2019

trstruth commented Jun 24, 2019

Rudd-O commented Jun 26, 2019 •

edited

Loading

m4dcoder commented Jun 27, 2019 •

edited

Loading

Rudd-O commented Jul 25, 2019 via email

johnarnold commented Aug 9, 2019

johnarnold commented Aug 9, 2019

arm4b commented Aug 9, 2019

m4dcoder commented Aug 9, 2019 •

edited

Loading

johnarnold commented Aug 9, 2019

m4dcoder commented Aug 9, 2019

anrajme commented Jun 20, 2022

DFINITYManu commented Jun 5, 2023

bell-manz commented Jul 19, 2023

guzzijones commented Jul 19, 2023

guzzijones commented Jul 19, 2023

guzzijones commented Jul 19, 2023

guzzijones commented Oct 31, 2023 •

edited

Loading

guzzijones commented Dec 6, 2023

Rudd-O commented Dec 20, 2023

Death of st2actionrunner process causes action to remain running forever #4716

Death of st2actionrunner process causes action to remain running forever #4716

Comments

Rudd-O commented Jun 21, 2019 • edited Loading

SUMMARY

OS / ENVIRONMENT / INSTALL METHOD

STEPS TO REPRODUCE

m4dcoder commented Jun 21, 2019

Rudd-O commented Jun 24, 2019

m4dcoder commented Jun 24, 2019

trstruth commented Jun 24, 2019

Rudd-O commented Jun 26, 2019 • edited Loading

m4dcoder commented Jun 27, 2019 • edited Loading

Rudd-O commented Jul 25, 2019 via email

johnarnold commented Aug 9, 2019

johnarnold commented Aug 9, 2019

arm4b commented Aug 9, 2019

m4dcoder commented Aug 9, 2019 • edited Loading

johnarnold commented Aug 9, 2019

m4dcoder commented Aug 9, 2019

anrajme commented Jun 20, 2022

DFINITYManu commented Jun 5, 2023

bell-manz commented Jul 19, 2023

guzzijones commented Jul 19, 2023

guzzijones commented Jul 19, 2023

guzzijones commented Jul 19, 2023

guzzijones commented Oct 31, 2023 • edited Loading

guzzijones commented Dec 6, 2023

Rudd-O commented Dec 20, 2023

Rudd-O commented Jun 21, 2019 •

edited

Loading

Rudd-O commented Jun 26, 2019 •

edited

Loading

m4dcoder commented Jun 27, 2019 •

edited

Loading

m4dcoder commented Aug 9, 2019 •

edited

Loading

guzzijones commented Oct 31, 2023 •

edited

Loading