Enable kubernetes_pod_operator to reattach_on_restart when the worker dies #21900

yeachan153 · 2022-03-01T15:32:44Z

Description

The kubernetes_pod_operator currently has a reattach_on_restart parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running.

We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the on_kill method:

airflow/airflow/models/taskinstance.py

Line 1425 in ace8c6e

self.task.on_kill()

This ends up deleting the pod that was created:

airflow/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Line 438 in ace8c6e

def on_kill(self) -> None:

We currently got around this problem by removing the the on_kill call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for the kubernetes_pod_operator and modified the is_eligible_to_retry function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed.

Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, with this workaround, stopping the task (via UI) now does not kill the pod, and clearing the task (via UI) causes a reattach when we would ideally like a restart.

Use case/motivation

Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up.

We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment).

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

potiuk · 2022-03-06T22:25:47Z

Do you have proposal to change the behaviour? Opening PR for that would be useful. Airflow has ~2000 contributors so you can become one of them. How do you think it can be improved?

wircho · 2024-03-22T20:09:18Z

@yeachan153 Did you ever solve this problem? We would love to be able to keep pods running during environment restarts, and it looks like your idea might work.

paramjeet01 · 2024-05-20T08:08:09Z

@wircho Increasing the termination_grace_period should help to mitigate this issue.

wenceslas-sanchez · 2024-06-22T19:26:15Z

Did you find a solution to this issue using KubernetesPodOperator parameters?

yeachan153 added the kind:feature Feature Requests label Mar 1, 2022

nathadfield added area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Aug 23, 2023

eladkal added the good first issue label Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable kubernetes_pod_operator to reattach_on_restart when the worker dies #21900

Enable kubernetes_pod_operator to reattach_on_restart when the worker dies #21900

yeachan153 commented Mar 1, 2022 •

edited

Loading

potiuk commented Mar 6, 2022

wircho commented Mar 22, 2024

paramjeet01 commented May 20, 2024

wenceslas-sanchez commented Jun 22, 2024

Enable kubernetes_pod_operator to reattach_on_restart when the worker dies #21900

Enable kubernetes_pod_operator to reattach_on_restart when the worker dies #21900

Comments

yeachan153 commented Mar 1, 2022 • edited Loading

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

potiuk commented Mar 6, 2022

wircho commented Mar 22, 2024

paramjeet01 commented May 20, 2024

wenceslas-sanchez commented Jun 22, 2024

yeachan153 commented Mar 1, 2022 •

edited

Loading