Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resubmit task when agent tries to acquire a task in Retried #762

Merged
merged 4 commits into from
Sep 27, 2024

Conversation

aneojgurhem
Copy link
Contributor

@aneojgurhem aneojgurhem commented Sep 24, 2024

Motivation

Fix the issue in which retried task appear in Creating and are not processed further, preventing application from running to completion.

To my understanding, during downscaling, preemption or crash and when the worker crashes too, the agent can be abruptly stopped during the submission of the retried task (task creation, task finalization or insertion into the queue). We also observe that the agent tries to acquire the initial task with status Retried instead of the new copy of the task (the new attempt). The retried task is only in the Creating status, showing that the task finalization was not properly done. This means that the initial task' finalization has not completed properly.

Description

To mitigate this issue, we implement a recovery mechanism that, when an agent tries to acquire the initial task, the agent will properly finalize the retried task before removing the initial task from the queue instead of only removing the initial task from the queue.

When the initial task with status Retried is acquired by the agent, we check whether the retry task has been properly finalized. To do so, we get the metadata of the retried task. If we can retrieve them and the retried task is in Creating or Submitted we perform task finalization and insertion in the queue again. If the task was already inserted in the queue, task deduplication should do its job and ignore the duplicate. If the retried task is in another status, we remove the message from the queue. If the retried task is not found, we submit the retried task completely (creation, finalization, queueing). If the retried task has already been created between our read and our creation in the database, we check the status of the retried tasks and perform finalization if task is Creating or Submitted.

Testing

I was not able to fully reproduce this issue in Core docker deployment even while trying to stop agents abruptly with the following script. I used a modified bench worker that was only producing errors.

#!/bin/sh

for i in $(seq 1 40); do
    for a in $(docker ps -q --filter name=armonik.compute.pollingagent); do
        docker restart -s sigterm -t 0 $a
        # sleep 1
    done
done

I also tried to vary the delay and still was not able to reproduce.

I added unit tests that can put tasks in the same state that we observed. They also validate that we are able to recover from the invalid state and resubmit the task in retry.

Impact

  • Acquisition of task in Retried status is more complex now and make more calls to the database, reducing performances while improving recovery on failure.
  • Sometimes, in the case explicited here, the retry task may be inserted twice into the queue, making use of the deduplication mechanism to remove the duplicata from the queue.

Checklist

  • My code adheres to the coding and style guidelines of the project.
  • I have performed a self-review of my code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • I have thoroughly tested my modifications and added tests when necessary.
  • Tests pass locally and in the CI.
  • I have assessed the performance impact of my modifications.

Adaptors/MongoDB/src/TaskTable.cs Outdated Show resolved Hide resolved
Common/src/Pollster/TaskHandler.cs Outdated Show resolved Hide resolved
@aneojgurhem aneojgurhem mentioned this pull request Sep 27, 2024
7 tasks
@aneojgurhem aneojgurhem merged commit b074b61 into main Sep 27, 2024
111 checks passed
@aneojgurhem aneojgurhem deleted the jg/fixcreating branch September 27, 2024 09:28
lemaitre-aneo added a commit that referenced this pull request Oct 8, 2024
# Motivation

The PR template will help to improve our documentation and explain more
clearly the reason and impact of what is happening on the repository.

# Description

Added file that adds a PR template for the repository. Got some ideas
from https:/pieterherman-dev/PR-Template-Guide/tree/main

# Testing

Template was used in another PR: #762.

# Impact

PR should not be approved without the template being filled.

# Checklist

- [ ] My code adheres to the coding and style guidelines of the project.
- [x] I have performed a self-review of my code.
- [ ] I have commented my code, particularly in hard-to-understand
areas.
- [ ] I have made corresponding changes to the documentation.
- [ ] I have thoroughly tested my modifications and added tests when
necessary.
- [x] Tests pass locally and in the CI.
- [ ] I have assessed the performance impact of my modifications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants