fix: resubmit task when agent tries to acquire a task in Retried #762

aneojgurhem · 2024-09-24T10:17:30Z

Motivation

Fix the issue in which retried task appear in Creating and are not processed further, preventing application from running to completion.

To my understanding, during downscaling, preemption or crash and when the worker crashes too, the agent can be abruptly stopped during the submission of the retried task (task creation, task finalization or insertion into the queue). We also observe that the agent tries to acquire the initial task with status Retried instead of the new copy of the task (the new attempt). The retried task is only in the Creating status, showing that the task finalization was not properly done. This means that the initial task' finalization has not completed properly.

Description

To mitigate this issue, we implement a recovery mechanism that, when an agent tries to acquire the initial task, the agent will properly finalize the retried task before removing the initial task from the queue instead of only removing the initial task from the queue.

When the initial task with status Retried is acquired by the agent, we check whether the retry task has been properly finalized. To do so, we get the metadata of the retried task. If we can retrieve them and the retried task is in Creating or Submitted we perform task finalization and insertion in the queue again. If the task was already inserted in the queue, task deduplication should do its job and ignore the duplicate. If the retried task is in another status, we remove the message from the queue. If the retried task is not found, we submit the retried task completely (creation, finalization, queueing). If the retried task has already been created between our read and our creation in the database, we check the status of the retried tasks and perform finalization if task is Creating or Submitted.

Testing

I was not able to fully reproduce this issue in Core docker deployment even while trying to stop agents abruptly with the following script. I used a modified bench worker that was only producing errors.

#!/bin/sh

for i in $(seq 1 40); do
    for a in $(docker ps -q --filter name=armonik.compute.pollingagent); do
        docker restart -s sigterm -t 0 $a
        # sleep 1
    done
done

I also tried to vary the delay and still was not able to reproduce.

I added unit tests that can put tasks in the same state that we observed. They also validate that we are able to recover from the invalid state and resubmit the task in retry.

Impact

Acquisition of task in Retried status is more complex now and make more calls to the database, reducing performances while improving recovery on failure.
Sometimes, in the case explicited here, the retry task may be inserted twice into the queue, making use of the deduplication mechanism to remove the duplicata from the queue.

Checklist

My code adheres to the coding and style guidelines of the project.
I have performed a self-review of my code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
I have thoroughly tested my modifications and added tests when necessary.
Tests pass locally and in the CI.
I have assessed the performance impact of my modifications.

Adaptors/MongoDB/src/TaskTable.cs

Common/src/Exceptions/TaskAlreadyExistsException.cs

Common/src/Pollster/TaskHandler.cs

Common/src/Exceptions/TaskAlreadyExistsException.cs

# Motivation The PR template will help to improve our documentation and explain more clearly the reason and impact of what is happening on the repository. # Description Added file that adds a PR template for the repository. Got some ideas from https:/pieterherman-dev/PR-Template-Guide/tree/main # Testing Template was used in another PR: #762. # Impact PR should not be approved without the template being filled. # Checklist - [ ] My code adheres to the coding and style guidelines of the project. - [x] I have performed a self-review of my code. - [ ] I have commented my code, particularly in hard-to-understand areas. - [ ] I have made corresponding changes to the documentation. - [ ] I have thoroughly tested my modifications and added tests when necessary. - [x] Tests pass locally and in the CI. - [ ] I have assessed the performance impact of my modifications.

fix: resubmit task when agent tries to acquire a task in Retried

22c9f42

aneojgurhem requested review from ngruelaneo, lemaitre-aneo and qdelamea-aneo September 24, 2024 10:17

aneojgurhem self-assigned this Sep 24, 2024

lemaitre-aneo reviewed Sep 24, 2024

View reviewed changes

Adaptors/MongoDB/src/TaskTable.cs Outdated Show resolved Hide resolved

Common/src/Exceptions/TaskAlreadyExistsException.cs Show resolved Hide resolved

Common/src/Pollster/TaskHandler.cs Outdated Show resolved Hide resolved

aneojgurhem added 2 commits September 25, 2024 10:36

fix: more narrow catch for task already existing in mongo

8daca5c

refactor: factorize some code

31e0a8e

lemaitre-aneo reviewed Sep 25, 2024

View reviewed changes

Common/src/Pollster/TaskHandler.cs Outdated Show resolved Hide resolved

Common/src/Exceptions/TaskAlreadyExistsException.cs Show resolved Hide resolved

refactor: use switch expression

934ca45

lemaitre-aneo approved these changes Sep 25, 2024

View reviewed changes

aneojgurhem mentioned this pull request Sep 27, 2024

docs: add PR template #764

Merged

7 tasks

aneojgurhem merged commit b074b61 into main Sep 27, 2024
111 checks passed

aneojgurhem deleted the jg/fixcreating branch September 27, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resubmit task when agent tries to acquire a task in Retried #762

fix: resubmit task when agent tries to acquire a task in Retried #762

aneojgurhem commented Sep 24, 2024 •

edited

Loading

fix: resubmit task when agent tries to acquire a task in Retried #762

fix: resubmit task when agent tries to acquire a task in Retried #762

Conversation

aneojgurhem commented Sep 24, 2024 • edited Loading

Motivation

Description

Testing

Impact

Checklist

aneojgurhem commented Sep 24, 2024 •

edited

Loading