Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Death of st2actionrunner process causes action to remain running forever #4716

Closed
Rudd-O opened this issue Jun 21, 2019 · 22 comments
Closed
Labels
bug HA StackStorm in High Availability K8s runners

Comments

@Rudd-O
Copy link

Rudd-O commented Jun 21, 2019

SUMMARY

Using StackStorm 3.0.1, if something kills an st2actionrunner process supervising a python-script action runner, and this action execution is part of a workflow execution, the action execution remains forever in state running regardless of parameters: timeout setting in the workflow.

What I'd like to see, is the action being rescheduled to another st2actionrunner, or at the very least, timed out so that a retry in the workflow can deal with that problem.

(It is not clear either how StackStorm deals with the death of a st2actionrunner supervising an orquesta action runner.)

This is not an HA setup, but nothing in the code or documentation leads me to believe that the expected behavior is to just hang a workflow execution when the underlying action runner supervisor process is gone. I'm thinking a machine in an HA setup crashes while ongoing workflows are executing actions on that machine, and then all workflows whose actions were running there, just hang, never to even timeout.

We expect to be able to run StackStorm for weeks on end, with long-running workflows that survive the death or reboot of a machine that is part of the StackStorm cluster.

OS / ENVIRONMENT / INSTALL METHOD

Standard non-HA recommended setup in Ubuntu 16.04

STEPS TO REPRODUCE

Create workflow with one Python action that runs sleep 60 via subprocess.
Start workflow with st2 run.
Kill st2actionrunner supervising the Python action.
Wait forever.

@m4dcoder
Copy link
Contributor

If the action runner process dies unexpectedly while an action execution is still executing, the action execution will be stuck in a running state because the action runner process didn't get a chance to update the database. We've recently added service discovery capability to the action runner. We will be adding garbage collection shortly to clean up these orphaned action executions and set them to something like an abandoned status. This will trigger the workflow execution to fail. When implemented, the service discovery feature will require user to configure a coordination backend such as redis server to work along side StackStorm.

@Rudd-O
Copy link
Author

Rudd-O commented Jun 24, 2019

That sounds like a plan. Thanks.

Meanwhile, how can I get abandoned actions to restart? That is crucial because most of our workflows run for a month or more, so if a box gets slammed, we see our workflows either fail or become stuck. It would be okay in our case to restart the specific failed action from the top, because our actions are all idempotent.

@m4dcoder
Copy link
Contributor

For the orquesta workflow engine, rerunning/restarting WF from failed task is not supported yet. It is currently WIP and planned for a future release.

@trstruth
Copy link
Member

Triggering the workflow to fail after one of the child tasks is abandoned seems like a sane default, but in many cases I'd like to be given the option to "retry" the abandoned task since many of our actionrunner failures are due to transient issues. Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this.

@Rudd-O
Copy link
Author

Rudd-O commented Jun 26, 2019

Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this.

This would be adequate for our use cases. Otherwise Orquesta basically makes it impossible to put the machines running the workflow engine in maintenance mode.

We would prefer the workflow's complete state (including published variables and current threads) be captured in a persistent manner within the database, such that the workflow can restart if the workflow engine is moved to a different box. This would be essentially what Jenkins does w.r.t. pipelines when the master restarts -- it persists the state of the pipelines, then when it reconnects to slaves, it catches up with what the slaves were doing.

@m4dcoder
Copy link
Contributor

m4dcoder commented Jun 27, 2019

I think there are different things being communicated here. As I understand, 1) there is the case where the action runner dies while executing an action for a task and that leads to the task and workflow stuck in running state. This is the original issue here. 2) You want to be able to rerun the task when the task and workflow execution failed as a result of 1. 3) You want to be able to pause the workflow execution, bring up another workflow engine, and resume the execution on the new server.

Item 3 already works today. You can pause the workflow execution. The state of the execution is saved to MongoDB. Then you can bring up a new workflow engine using the same st2.conf and shut down the old workflow engine. Resume the workflow execution and the workflow engine will pick up where it left off. If you are running different versions of st2, then be careful there are no breaking changes in between versions.

For item 1, per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running.

For item 2, we have a WIP feature to rerun a workflow execution from one or more failed tasks. We will make sure this supports item 1 where the action execution is abandoned.

@Rudd-O
Copy link
Author

Rudd-O commented Jul 25, 2019 via email

@johnarnold
Copy link

@m4dcoder can you explain the recently added GC process when a workflow task is stuck? i.e. if an actionrunner machine reboots.

I'm reading the code and it appears that if a task gets stuck, the GC will kill the whole workflow execution. If this is the case, I don't think that is desired behavior. I think the stuck task should fail, but the workflow should be able to handle the failure, with a retry or other workflow path.

@johnarnold
Copy link

Also, I think the action runner and workflow engine need to support a "warm shutdown" TERM signal to the process. The idea being that they should finish their work before they exit, minimizing orphaned actions or lost workflow state.

For the workflow engine, this may mean initiating a pausing/paused before shutting the process.

For the action runner, this may mean that it stops accepting any new work, and completes it's current running work before exiting (with a hard timeout value).

We use this type of behavior for Celery workers today. See: http://docs.celeryproject.org/en/master/userguide/workers.html#stopping-the-worker

@arm4b
Copy link
Member

arm4b commented Aug 9, 2019

+1, in Kubernetes environment all services die, respawn and get rescheduled to another nodes on a regular basis.

It's a reality requirements to make StackStorm handle these cases as normal situation, especially thinking about it in HA context.

@m4dcoder
Copy link
Contributor

m4dcoder commented Aug 9, 2019

Currently, GC will cancel the workflow execution if it has been pass max idle time without any activity (i.e. active == task execution still executing). The current GC does not cover the use cases where action runner died/reboot in the meeting of executing an action. When this happens, the action execution will be stuck in a running state and so is the corresponding task execution record. This will not trigger the current GC to clean up the workflow execution. Note this GC functionality is disabled by default in v3.1.

per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running.

Per the solution here for this issue which we haven't implement yet, when GC abandons the action execution, it has the same affect in failing the action execution and the task execution which will trigger whatever clean up defined in the workflow definition.

@johnarnold
Copy link

@m4dcoder ok, is anyone working on GC for the action exection / actionrunner restart scenario?

@m4dcoder
Copy link
Contributor

m4dcoder commented Aug 9, 2019

This is not currently prioritized for the next v3.2 release and we already started on v3.2. If this is something the community needs, st2 is open source and we welcome contribution. We will dedicate time to help and review with code contribution.

@anrajme
Copy link

anrajme commented Jun 20, 2022

Hi there - Checking whether this is still on the road map for any releases near soon? This requirement has real significance in the stackstorm-ha world especially when the nodes/pods get killed/restarted often in the k8s world compared to the traditional deployment model.

@DFINITYManu
Copy link

Yeah. Kubernetes rollouts of new packs (using the st2packs sidecar containers built for the purpose) restart action runners, which means the restarted action runners usually leave actions behind, "running" as ghosts, which obviously torpedoes our long-running workflows. The garbage collector also does not collect tasks in "running" state by default either. And tasks whose executors have gone AWOL simply cannot be canceled from the UI (it directs the user to look at the developer console, see screenshot).

To add to that complication, default retry behavior "when task is abandoned" is still not implemented in orquesta, when a transient failure of this type happens either -- we end up having to code retries on every task in each workflow, which is relatively easy to screw up for us workflow developers.

These deficiencies make StackStorm usage in a modern production environment a very difficult pitch. Truly great in theory -- in practice very painful to deploy and maintain.

image

@bell-manz
Copy link

Hi, is there any updates on this ticket? My team is trying to deploy StackStorm HA but we are running into this issue which isn't acceptable for our use case. :(

@guzzijones
Copy link
Contributor

If the kill signal is sent to an actionrunner it should wait till the action finishes if you have graceful shutdown on. Do you have gracefull shutdown enabled in the config? There is also an exit timeout and sleep delay setting.code

@guzzijones
Copy link
Contributor

looks like you have to also increase the terminationGracePeriodSeconds in your chart. The default is 30 seconds.

@guzzijones
Copy link
Contributor

Looks like there are similar settings for the workflowengine. again you will also have to set the terminationGracePeriodSeconds in your chart to a sane time.

@guzzijones
Copy link
Contributor

guzzijones commented Oct 31, 2023

Make sure that your timeouts are all set correct vs your action timeouts.

Most of our actions timeout after 10 minutes. So we set the following to at least allow the action timeouts to trigger before the graceful shutdowns.

  1. action timeouts: 600 seconds for most of our actions. we have a couple set to 900, but we want to cover most.
  2. actionrunner: gracefull timeout settings in values.yaml in config settings. set it to a bit longer than the action timeout.
 [actionrunner]
      graceful_shutdown = True
      exit_still_active_check = 610
      still_active_check_interval = 10
  1. termination Grace Period Seconds in values.haml for action runner
 st2actionrunner:
    terminationGracePeriodSeconds: 630

With these settings we are hoping that worst case actions actually get abandoned because the actionrunner shutdown method will have enough time to abandon them before the k8 pod timeout hits 20 seconds later

Also we set the st2workflowengine timeouts:

  st2workflowengine:
    terminationGracePeriodSeconds: 630

Also, when you run helm upgrade be sure to extend the timeout using --timeout 20m

@guzzijones
Copy link
Contributor

2 more notes:

  1. do NOT put inline comments in your config file
  2. you must enable coordination | service_registry = True in the config for graceful shutdown to wait for actions to finish.

@Rudd-O
Copy link
Author

Rudd-O commented Dec 20, 2023

Why the sensible options being discussed here are not defaults, it's a mystery to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug HA StackStorm in High Availability K8s runners
Projects
None yet
Development

No branches or pull requests

9 participants