You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The kubernetes_pod_operator currently has a reattach_on_restart parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running.
We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the on_kill method:
We currently got around this problem by removing the the on_kill call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for the kubernetes_pod_operator and modified the is_eligible_to_retry function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed.
Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, with this workaround, stopping the task (via UI) now does not kill the pod, and clearing the task (via UI) causes a reattach when we would ideally like a restart.
Use case/motivation
Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up.
We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment).
Do you have proposal to change the behaviour? Opening PR for that would be useful. Airflow has ~2000 contributors so you can become one of them. How do you think it can be improved?
@yeachan153 Did you ever solve this problem? We would love to be able to keep pods running during environment restarts, and it looks like your idea might work.
Description
The
kubernetes_pod_operator
currently has areattach_on_restart
parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running.We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the
on_kill
method:airflow/airflow/models/taskinstance.py
Line 1425 in ace8c6e
This ends up deleting the pod that was created:
airflow/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
Line 438 in ace8c6e
We currently got around this problem by removing the the
on_kill
call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for thekubernetes_pod_operator
and modified the is_eligible_to_retry function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed.Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, with this workaround, stopping the task (via UI) now does not kill the pod, and clearing the task (via UI) causes a reattach when we would ideally like a restart.
Use case/motivation
Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up.
We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment).
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: