-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find a solution to handle the execution timeout in Triggers #32638
Comments
@hussein-awala - are you working on this? If not, @vandonr-amz could look into it during this week. |
@shubham22 yes I'm working on this, I already tried different approaches such as using |
But triggers already have a timeout mechanism, couldn't we reuse that logic to enforce the task timeout as well ? |
The current mechanism is not an implementation for timeout (time-based), but it's based on an extra parameter called Additionally, the trigger could be adopted by another triggerer if we lose the one that runs the trigger. Since the consumed attempts are not synchronized with the Metadata, it will restart from the beginning with each run. What I'm trying to do is:
What are the benefits of this mechanism?
What do you think about this solution? |
Ha yes indeed, I was mistaken, I was thinking about the airflow/airflow/models/baseoperator.py Line 1577 in 12b0b6b
Which kindda is a timeout for triggers, except it's defined when we defer and not in the object. Not too sure how that's handled underneath. |
Yes, even for this mechanism, we need to add the timeout explicitly and it's different from task timeout in normal mode. Currently we have this method: airflow/airflow/jobs/scheduler_job_runner.py Lines 1653 to 1670 in a2ae226
which change the state of the tasks to scheduled and change the next method to __fail__ .
The value airflow/airflow/models/taskinstance.py Lines 1692 to 1700 in a2ae226
And the Trigger is canceled by the triggerer when it detects that the task is not in deffered state anymore. I'm trying to reuse this mechanism and improve it by implementing what I explained before. |
@hussein-awala - This may be somewhat related to what you're working on, but there are a few Async Sensors that DO respect timeout, like Take these two:
These are two Sensors with exact same configuration params, only difference is that one is the Async sensor that is supposed to be a drop-in replacement for the Sync sensor. When you let this run, looking for a table that it doesnt find, the Sync sensor times out as it should after 60 seconds. The Async sensor retries 3 times and ends up waiting a total of (60 seconds * 4 attempts) + (retry wait interval) * 3. That is not expected. We'd want that |
This is the main issue I'm trying to fix in my PR which is almost ready. However I'm trying to find a solution for providers which are running with an older version of Airflow. The issue is that currently we raise |
Thank you so much for working on this @hussein-awala ! I agree with the problem statement - that |
Is this actually true? i tried this locally and it seems that execution_timeout is respected even when the task is in a state of deferral. is the actual problem just that, with sensor tasks, there are retries when there shouldn't be? by the way, if the user doesn't want retries, why doesn't the user just set retries to 0? |
i guess the issue is you want retries if it fails for some reason other than sensor out of time... |
No, after diving in the code, I found that the problem is with the callback of the execution timeout and the sensor timeout which is completely ignored.
Exactly, the sensor timeout is applied on the overall time and not only the retry duration. |
Body
Currently, when we run tasks/sensors in defferable mode, the execution timeout is ignored where it is a TaskInstance property, and the Trigger doesn't use and handle it.
IMO we should update the Trigger execution logic to take this parameter into account, and stop the execution once the timeout is reached.
To achieve that, we need:
related: #32580
Committer
The text was updated successfully, but these errors were encountered: