-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Metrics for Running PipelinesRuns at Pipeline and Namespace level #8280
base: main
Are you sure you want to change the base?
Conversation
Hi @pramodbindal. Thanks for your PR. I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/kind feature |
/test check-pr-has-kind-label |
@khrm: The specified target(s) for
The following commands are available to trigger optional jobs:
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-build-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pramodbindal can you squash your commits ? Otherwise LGTM.
b802e66
to
6951973
Compare
Done |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We might want to add a release note entry
- It seems it needs to be rebased ?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pritidesai @abayer |
5c2ef54
to
4a05727
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
b0679b0
to
7891813
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
d3be2f8
to
a73f04d
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I left some more comments.
863ebbd
to
d9c0b48
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @pramodbindal.
I went deeper in the code and tests, and found out that the current code does not work because it never receives the requested configuration.
Once that's fixed, I suspect there may be further issues as we don't specify the key when recording the metric, but that needs to be verified and we need better test coverage.
| -- | ----------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| metrics.taskrun.level | `taskrun` | Level of metrics is taskrun | | ||
| metrics.taskrun.level | `task` | Level of metrics is task and taskrun label isn't present in the metrics | | ||
| metrics.taskrun.level | `namespace` | Level of metrics is namespace, and task and taskrun label isn't present in the metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: ...and task
and taskrun
labels aren't present in the metrics
The plural is missing in a few more lines in the table, but that's something we can fix in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is PR is for Pipeline-Run only. there is no change in how TaskRun metrics are reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I put the comment in the first line where this happens, but the same syntax error is in the pipelinerun lines too. As I said, it can be fixed separately
| metrics.pipelinerun.duration-type | `histogram` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type histogram | | ||
| metrics.pipelinerun.duration-type | `lastvalue` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type gauge or lastvalue | | ||
| metrics.count.enable-reason | `false` | Sets if the `reason` label should be included on count metrics | | ||
| metrics.taskrun.throttle.enable-namespace | `false` | Sets if the `namespace` label should be included on the `tekton_pipelines_controller_running_taskruns_throttled_by_quota` metric | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: extra spaces at the end of each line before the |
sign.
pkg/pipelinerunmetrics/metrics.go
Outdated
runningPRLevelPipelinerun = "pipelinerun" | ||
runningPRLevelPipeline = "pipeline" | ||
runningPRLevelNamespace = "namespace" | ||
runningPRLevelBlank = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: maybe runningPRLevelCluster
?
tag.Insert(namespaceTag, pr.Namespace), | ||
tag.Insert(pipelineTag, pipelineName), | ||
} | ||
if r.cfg != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NewRecorder
function uses config to initialise views in the viewRegister
function, but it does not store config as part of the recorder itself. This means that r.cfg
here is always going to be nil
.
This results in having the views set up correctly but the counting being always at cluster level.
Existing testing is unfortunately insufficient to see the issue, as the test case only sets up a single running PR which does not allow for any aggregation or filtering to happen.
I tried the following change to the test:
for _, pipelineRun := range []*v1.PipelineRun{
newPipelineRun(corev1.ConditionTrue, "testns"),
newPipelineRun(corev1.ConditionUnknown, "testns"),
newPipelineRun(corev1.ConditionUnknown, "testns2"),
newPipelineRun(corev1.ConditionFalse, "testns"),
} {
if err := informer.Informer().GetIndexer().Add(pipelineRun); err != nil {
t.Fatalf("Adding TaskRun to informer: %v", err)
}
}
ctx = getConfigContextRunningPRLevel("namespace")
recorder, err := NewRecorder(ctx)
if err != nil {
t.Fatalf("NewRecorder: %v", err)
}
if err := recorder.RunningPipelineRuns(informer.Lister()); err != nil {
t.Errorf("RunningPipelineRuns: %v", err)
}
metricstest.CheckLastValueData(t, "running_pipelineruns", map[string]string{"namespace": "testns"}, 1)
metricstest.CheckLastValueData(t, "running_pipelineruns", map[string]string{"namespace": "testns2"}, 1)
which results in both pipelineruns being counted in each namespace:
/System/Volumes/Data/go/src/github.com/tektoncd/pipeline/pkg/pipelinerunmetrics/metrics_test.go:563: Reporter expected a different tag value for key metric running_pipelineruns key namespace got testns2 want testns
/System/Volumes/Data/go/src/github.com/tektoncd/pipeline/pkg/pipelinerunmetrics/metrics_test.go:563: Reporter.Report() wrong value metric running_pipelineruns got 2 want 1
/System/Volumes/Data/go/src/github.com/tektoncd/pipeline/pkg/pipelinerunmetrics/metrics_test.go:564: Reporter.Report() wrong value metric running_pipelineruns got 2 want 1
pkg/pipelinerunmetrics/metrics.go
Outdated
// If all PipelineRuns for a Pipeline are completed then record it with 0 | ||
// set the key to 0 so that we do not record further completed PipelineRuns for same Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is a bit misleading here, it's not always "PipelineRuns for a Pipeline" as it depends on the config.
In case there are no running PipelineRuns for the specific key, set the metric value to 0 to ensure the metric is set for the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// In case there are no running PipelineRuns for the pipelineRunKey, set the metric value to 0 to ensure the metric is set for the key.
if !pr.IsDone() { | ||
countMap[pipelineRunKey]++ | ||
metrics.Record(ctx_, runningPRs.M(float64(countMap[pipelineRunKey]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The countMap
is only complete at the end of the loop, so why is the metric reported here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric might have different tags. For instance, in case we configure the metric to be at "namespace" level, it will have one counter for each namespace. How does the metric API know what tag value it is that we're counting here? Shouldn't we pass the pipelineRunKey
somehow as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric might have different tags. For instance, in case we configure the metric to be at "namespace" level, it will have one counter for each namespace. How does the metric API know what tag value it is that we're counting here? Shouldn't we pass the
pipelineRunKey
somehow as well?
pipelineRunKey is generated on the basis of configuration if config is at namespace level then namespace is the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
countMap
is only complete at the end of the loop, so why is the metric reported here?
There is no way of knowing if there is any more pipeline runs for same key.
Incase there are more metrics for same key then on next instance metric is reported with updated value and we are just considering the last value of the metric so same is updated accordingly.
// If all PipelineRuns for a Pipeline are completed then record it with 0 | ||
// set the key to 0 so that we do not record further completed PipelineRuns for same Pipeline | ||
if _, exists := countMap[pipelineRunKey]; !exists { | ||
countMap[pipelineRunKey] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The countMap
is only complete at the end of the loop, so why is the metric reported here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reason as above.
this code is reset the metric already reported.
lets assume we have 4 PRs running for a key at given point in time so metric is reported with 4.
in next call to function all PR are completed so they wont be reported in in block which mean their metric report will still be showing 4 on graph which is wrong data.
but resetting the value to 0 here solves the issues
if there is any running PR for the key then that will be handled by if block. if not then anyways we are setting to 0.
if err := recorder.RunningPipelineRuns(informer.Lister()); err != nil { | ||
t.Errorf("RunningPipelineRuns: %v", err) | ||
} | ||
metricstest.CheckLastValueData(t, "running_pipelineruns", map[string]string{"namespace": "testns", "pipeline": "anonymous"}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tags namespace
and pipeline
are set correctly, but the value 1
is only correct because the is only one running pipeline in the cluster. Adding more running pipelines, whether in the same namespace or others, causes the test to fail. This is because of the issue with the configuration highlighted in metrics.go
.
I think it would be nice to have this as a matrix test which covers various possible scenarios, with configuration set and cluster, namespace, pipeline and pipelinerun level, and different initial inputs for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it be ok if I provide those tests in new PR ?
/hold The logic to setup the key from config and to report the metrics needs to be fixed first, test coverage has to be extended |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Currently metrics shown for Running Pipeline Count is at cluster level. We have added the PipelineRun metric at pipeline and namespace level. Fix Lint Error Fix Lint Error
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
/retest |
785e2e6
to
d1757a6
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Changes
""
. but it can be set tonamespace
orPipelinerun
.Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes