-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric-collector cronjob spawns unlimited jobs #659
Labels
Comments
epa095
added a commit
to epa095/katib
that referenced
this issue
Jun 18, 2019
This changes the `spec.concurrencyPolicy` of the metric collector cron-job from "Allow" (default) to "Forbid". The cronjob used to create a new job even if the previous job had not succeeded. On high-load clusters this could lead to a high number of jobs which never finished. This fixed kubeflow#659
epa095
added a commit
to epa095/katib
that referenced
this issue
Jun 22, 2019
This changes the `spec.concurrencyPolicy` of the metric collector cron-job from "Allow" (default) to "Forbid". The cronjob used to create a new job even if the previous job had not succeeded. On high-load clusters this could lead to a high number of jobs which never finished. This fixed kubeflow#659
k8s-ci-robot
pushed a commit
that referenced
this issue
Jun 27, 2019
This changes the `spec.concurrencyPolicy` of the metric collector cron-job from "Allow" (default) to "Forbid". The cronjob used to create a new job even if the previous job had not succeeded. On high-load clusters this could lead to a high number of jobs which never finished. This fixed #659
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
/kind bug
What steps did you take and what happened:
Run a "high" amount of paralell jobs relative to your cluster size.
What did you expect to happen:
Things to work, but slowly.
What happened:
The metric-collector cron jobs created by katib keeps spawning new jobs, which don't complete before the new ones are created (since the cluster is under pressure).
Proposed solution:
I know that there is a issue to change to a push-based #577 metric collector, but a short-term fix for this is, I think, to change the concurrency-policy of the cron-jobs to have
Forbid
instead of the defaultAllow
. Then at least only a single instance of the metric-collector jobs is initiated at a time.Environment:
The text was updated successfully, but these errors were encountered: