fix: resubmit task when agent tries to acquire a task in Retried #762
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Fix the issue in which retried task appear in Creating and are not processed further, preventing application from running to completion.
To my understanding, during downscaling, preemption or crash and when the worker crashes too, the agent can be abruptly stopped during the submission of the retried task (task creation, task finalization or insertion into the queue). We also observe that the agent tries to acquire the initial task with status Retried instead of the new copy of the task (the new attempt). The retried task is only in the Creating status, showing that the task finalization was not properly done. This means that the initial task' finalization has not completed properly.
Description
To mitigate this issue, we implement a recovery mechanism that, when an agent tries to acquire the initial task, the agent will properly finalize the retried task before removing the initial task from the queue instead of only removing the initial task from the queue.
When the initial task with status Retried is acquired by the agent, we check whether the retry task has been properly finalized. To do so, we get the metadata of the retried task. If we can retrieve them and the retried task is in Creating or Submitted we perform task finalization and insertion in the queue again. If the task was already inserted in the queue, task deduplication should do its job and ignore the duplicate. If the retried task is in another status, we remove the message from the queue. If the retried task is not found, we submit the retried task completely (creation, finalization, queueing). If the retried task has already been created between our read and our creation in the database, we check the status of the retried tasks and perform finalization if task is Creating or Submitted.
Testing
I was not able to fully reproduce this issue in Core docker deployment even while trying to stop agents abruptly with the following script. I used a modified bench worker that was only producing errors.
I also tried to vary the delay and still was not able to reproduce.
I added unit tests that can put tasks in the same state that we observed. They also validate that we are able to recover from the invalid state and resubmit the task in retry.
Impact
Checklist