Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task remains in "running" state when last attempt timed out #79165

Closed
mikecote opened this issue Oct 1, 2020 · 4 comments · Fixed by #80681
Closed

Task remains in "running" state when last attempt timed out #79165

mikecote opened this issue Oct 1, 2020 · 4 comments · Fixed by #80681
Assignees
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Oct 1, 2020

This is outlined in the documentation under limitations, just very strange once we have a Task Manager UI.

Steps to reproduce:

  • Create a custom task definition with a runner that waits for a never resolving promise await new Promise(resolve => {});
  • Schedule a task of that type
  • Monitor the .kibana_task_manager document and once it reaches attempts: 3 (or 2?), you will see it remain in running status

NOTE: The default timeouts and retry can be long, you can modify those in the task definition (to ex: timeout: '10s') and in the task runner (to ex: 30s) for backoff multiple to test this.

@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 1, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor

gmmorris commented Oct 5, 2020

I've done some investigation and it looks like we can use the updatebyQuery to solve this.
The query filters out tasks whose attempts have exceeded the max attempts, but instead we can pick these up as well and in the updateFields function, instead of just blindly setting to claiming we can write some painless code that will instead set the status to failed.

export const updateFields = (fieldUpdates: {
[field: string]: string | number | Date;
}): ScriptClause => ({
source: Object.keys(fieldUpdates)
.map((field) => `ctx._source.task.${field}=params.${field};`)
.join(' '),

@ymao1 ymao1 self-assigned this Oct 13, 2020
@ymao1
Copy link
Contributor

ymao1 commented Oct 14, 2020

Waiting on this issue until #80371 is investigated/resolved. My concern is that if there is some natural limit to the updateByQuery that is preventing older docs from being returned/updated/sorted, removing the clause to limit the search by maxAttempts would return even more newer, potentially failed docs, exacerbating the issue of not picking up older documents.

@pmuellr
Copy link
Member

pmuellr commented Oct 16, 2020

It seems that we found the issue with the zombie idle tasks in TM, so we can proceed here. There's a proposed fix in PR #80692, I suspect it will not affect this issue, seems like we can merge the fixes independently.

@mikecote mikecote mentioned this issue Oct 26, 2020
36 tasks
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants