-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deferred tasks get kill during heartbeat callback in some rare cases #40435
Comments
I don't think a task in the deferred state should be counted as running. Once a task is deferred, it should exit as tasks do for other terminal states like success and failed. For the LocalTaskJobRunner to produce the message
The reporter also mentioned in another comment that their scheduler may not well resourced. However, this statement cannot be confirmed without the actual resource utilization graph.
Regardless, it would beneficial to see what happened to the task after being marked as deferred. The task log should tell us the exact timestamp. Compare that with when the LocalTaskJobRunner terminated the task, we can at least have a clearer view of the task/process lifecycle. I also think the full task log would help the investigation as the terminal code may provide additional context. I suspect the issue is something to do with the StandardTaskRunner failing to exit for some reason. Considering deferred as a "running" state would probably let the LocalTaskJobRunner and StandardTaskRunner run indefinitely. |
Logs of a failed task:
Logs of a good task (same task after retry):
|
@wolfier thanks for the detailed answer. I've put the logs, I hopes it helps. For context our deferred task submit jobs to AWS batch and wait for completion. I don't suspect the submission would block for more than 300 seconds / scheduler_zombie_task_threshold But I know for fact this issue happens when a lot of these deferred tasks start at the same time. |
I'll start with the successful task / trigger execution. The task exited right away with the return code. The trigger fired as expected after the conditions are met.
The failed task instance did not exit therefore does not have a return code. Given the logs, the reported behaviour is then expected.
Surprisingly, the exit code was 100. We know that two consecutive calls of heartbeat_callback saw the I suspect one of the following happened:
|
@wolfier does each deferred task start a new process? It wouldn't surprised me, if there are a lot of deferred tasks starting/running at the same time that the overhead of each process causes some of them to be killed because OMM / lack of memory. Does the scheduler throttle / limit the number of deferred tasks that are running at the same time? |
That's correct. Each task execution will spawn a new process. A deferrable operator task will be executed twice. Once to submit the trigger to the triggerer and another time to process the trigger event. The process should be fairly short running though as you can see with the successful attempt. I am not sure how MWAA is setup but the triggerer and worker should be running in two different execution spaces. This means the number of concurrent triggers should not affect the worker.
The scheduler does not but the triggerer does have a limit of triggers per triggerer. |
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.1 (MWAA)
What happened?
My airflow deferred (AWS) BatchOperator occasionally fail. When this happens I can see this:
I actually think it's an oversight in the code for local_task_job_runner. This line should check for state being RUNNING or DEFERRED.
What you think should happen instead?
No response
How to reproduce
This is hard to reproduce. The issue is very transient.
Operating System
MWAA
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.16.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
MWAA
Anything else?
I would say it happens one out of 100 run.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: