Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric-collector cronjob spawns unlimited jobs #659

Closed
epa095 opened this issue Jun 18, 2019 · 0 comments · Fixed by #660
Closed

Metric-collector cronjob spawns unlimited jobs #659

epa095 opened this issue Jun 18, 2019 · 0 comments · Fixed by #660
Labels

Comments

@epa095
Copy link
Contributor

epa095 commented Jun 18, 2019

/kind bug

What steps did you take and what happened:
Run a "high" amount of paralell jobs relative to your cluster size.

What did you expect to happen:
Things to work, but slowly.

What happened:
The metric-collector cron jobs created by katib keeps spawning new jobs, which don't complete before the new ones are created (since the cluster is under pressure).

Proposed solution:
I know that there is a issue to change to a push-based #577 metric collector, but a short-term fix for this is, I think, to change the concurrency-policy of the cron-jobs to have Forbid instead of the default Allow. Then at least only a single instance of the metric-collector jobs is initiated at a time.

Environment:

  • Katib version: v0.1.2-alpha-156-g4ab3dbd
epa095 added a commit to epa095/katib that referenced this issue Jun 18, 2019
This changes the `spec.concurrencyPolicy` of the metric collector 
cron-job from "Allow" (default) to "Forbid". The cronjob used to
create a new job even if the previous job had not succeeded. On
high-load clusters this could lead to a high number of jobs which
never finished. 

This fixed kubeflow#659
epa095 added a commit to epa095/katib that referenced this issue Jun 22, 2019
This changes the `spec.concurrencyPolicy` of the metric collector
cron-job from "Allow" (default) to "Forbid". The cronjob used to
create a new job even if the previous job had not succeeded. On
high-load clusters this could lead to a high number of jobs which
never finished.

This fixed kubeflow#659
k8s-ci-robot pushed a commit that referenced this issue Jun 27, 2019
This changes the `spec.concurrencyPolicy` of the metric collector
cron-job from "Allow" (default) to "Forbid". The cronjob used to
create a new job even if the previous job had not succeeded. On
high-load clusters this could lead to a high number of jobs which
never finished.

This fixed #659
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants