Metric-collector cronjob spawns unlimited jobs #659

epa095 · 2019-06-18T11:34:40Z

/kind bug

What steps did you take and what happened:
Run a "high" amount of paralell jobs relative to your cluster size.

What did you expect to happen:
Things to work, but slowly.

What happened:
The metric-collector cron jobs created by katib keeps spawning new jobs, which don't complete before the new ones are created (since the cluster is under pressure).

Proposed solution:
I know that there is a issue to change to a push-based #577 metric collector, but a short-term fix for this is, I think, to change the concurrency-policy of the cron-jobs to have Forbid instead of the default Allow. Then at least only a single instance of the metric-collector jobs is initiated at a time.

Environment:

Katib version: v0.1.2-alpha-156-g4ab3dbd

The text was updated successfully, but these errors were encountered:

This changes the `spec.concurrencyPolicy` of the metric collector cron-job from "Allow" (default) to "Forbid". The cronjob used to create a new job even if the previous job had not succeeded. On high-load clusters this could lead to a high number of jobs which never finished. This fixed kubeflow#659

This changes the `spec.concurrencyPolicy` of the metric collector cron-job from "Allow" (default) to "Forbid". The cronjob used to create a new job even if the previous job had not succeeded. On high-load clusters this could lead to a high number of jobs which never finished. This fixed #659

k8s-ci-robot added the kind/bug label Jun 18, 2019

epa095 mentioned this issue Jun 18, 2019

MetricController: Run only a single job per task #660

Merged

k8s-ci-robot closed this as completed in #660 Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric-collector cronjob spawns unlimited jobs #659

Metric-collector cronjob spawns unlimited jobs #659

epa095 commented Jun 18, 2019

Metric-collector cronjob spawns unlimited jobs #659

Metric-collector cronjob spawns unlimited jobs #659

Comments

epa095 commented Jun 18, 2019