Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable kubernetes_pod_operator to reattach_on_restart when the worker dies #21900

Open
1 of 2 tasks
yeachan153 opened this issue Mar 1, 2022 · 4 comments
Open
1 of 2 tasks
Labels

Comments

@yeachan153
Copy link
Contributor

yeachan153 commented Mar 1, 2022

Description

The kubernetes_pod_operator currently has a reattach_on_restart parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running.

We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the on_kill method:

self.task.on_kill()

This ends up deleting the pod that was created:

We currently got around this problem by removing the the on_kill call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for the kubernetes_pod_operator and modified the is_eligible_to_retry function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed.

Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, with this workaround, stopping the task (via UI) now does not kill the pod, and clearing the task (via UI) causes a reattach when we would ideally like a restart.

Use case/motivation

Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up.

We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment).

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@yeachan153 yeachan153 added the kind:feature Feature Requests label Mar 1, 2022
@potiuk
Copy link
Member

potiuk commented Mar 6, 2022

Do you have proposal to change the behaviour? Opening PR for that would be useful. Airflow has ~2000 contributors so you can become one of them. How do you think it can be improved?

@wircho
Copy link

wircho commented Mar 22, 2024

@yeachan153 Did you ever solve this problem? We would love to be able to keep pods running during environment restarts, and it looks like your idea might work.

@paramjeet01
Copy link

@wircho Increasing the termination_grace_period should help to mitigate this issue.

@wenceslas-sanchez
Copy link

Did you find a solution to this issue using KubernetesPodOperator parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants