Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunningPipelineRun metric is racy #2729

Closed
mattmoor opened this issue Jun 2, 2020 · 9 comments · Fixed by #2760 or #2762
Closed

RunningPipelineRun metric is racy #2729

mattmoor opened this issue Jun 2, 2020 · 9 comments · Fixed by #2760 or #2762
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mattmoor
Copy link
Member

mattmoor commented Jun 2, 2020

I was looking at the PipelineRun controller, and noticed that there is a metric for the number of running PipelineRuns:

if updated {
go func(metrics *Recorder) {
err := metrics.RunningPipelineRuns(c.pipelineRunLister)
if err != nil {
c.Logger.Warnf("Failed to log the metrics : %v", err)
}
}(c.metrics)
}

This metric is updated when one of the reconciler's trailing updates modifies the resource (status or labels/annotations), but it bases its metric on the informer's lister cache. Now, informers are fast, but in all likelihood the lister that is being used to drive this metric hasn't received the updated version of the resource from the API Server, so unless the informer cache's copy is being modified (a no no) then the cache won't reflect what it's clearly (given the if updated check) trying to capture.

Perhaps this metric could even be completely decoupled from this reconciler? It seems like fairly separable logic that's needlessly coupled to reconciliation. 🤷

@vdemeester
Copy link
Member

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 2, 2020
@mattmoor
Copy link
Member Author

mattmoor commented Jun 2, 2020

I'd probably say /kind bug? 😇

@vdemeester
Copy link
Member

vdemeester commented Jun 2, 2020

@mattmoor for my defense, I was on PTO, my brain is not completely wired back correctly it seems 😹
/kind bug
/remove-kind feature

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 2, 2020
@vdemeester vdemeester removed the kind/feature Categorizes issue or PR as related to a new feature. label Jun 2, 2020
@tekton-robot
Copy link
Collaborator

@vdemeester: Those labels are not set on the issue: kind/feature

In response to this:

@mattmoor for my defense, I was on PTO, my brain is not completely wired back correctly it seems 😹
/kind bug
/remove-kind feature

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 2, 2020

unless the informer cache's copy is being modified (a no no)

I am seeing the informer's copy being modified in both updateStatus and updateLabelsAndAnnotations.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 2, 2020

This modifies a resource fetched from a lister (informer cache):

newPr.Status = pr.Status

@mattmoor
Copy link
Member Author

mattmoor commented Jun 2, 2020

You can watch things explode by adding a line like pr := pr.DeepCopy() to the top of each of these loops:

for _, pr := range d.PipelineRuns {

mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 3, 2020
In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method.

In pseudo-code:

```go
func update(fr *FooRun) {
  newFr := lister.Get(fr.Name)

  if reflect.DeepEqual(newFr.Field, fr.Field) {
    newFr.Field = fr.Field   // This modified the informer's copy!
    return client.Update(newFr)
  }
}
```

I have worked around this in two different ways:

1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy.

2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks.  This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update.  This also works around some fun interactions with the test code (see fixed issue).

There are two other notable aspects to this change:

1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered.  I changed most of these to refetch through the client.
2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in.

It is also worth calling out that this change will very likely destabilize the metric that I identified [here](tektoncd#2729) as racy, which is likely masked by the mutation of the informer copies.

Fixes: tektoncd#2734
@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

Alright, so #2736 should address the modifications of the resources in the informer cache, but I didn't touch this code because no tests blew up (yet 🤞 ) and I feel like fixing this race needs a deeper design conversation.

Not sure how folks want to proceed, but that PR will almost certainly make this metric flakier until this issue is fixed.

tekton-robot pushed a commit that referenced this issue Jun 3, 2020
In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method.

In pseudo-code:

```go
func update(fr *FooRun) {
  newFr := lister.Get(fr.Name)

  if reflect.DeepEqual(newFr.Field, fr.Field) {
    newFr.Field = fr.Field   // This modified the informer's copy!
    return client.Update(newFr)
  }
}
```

I have worked around this in two different ways:

1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy.

2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks.  This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update.  This also works around some fun interactions with the test code (see fixed issue).

There are two other notable aspects to this change:

1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered.  I changed most of these to refetch through the client.
2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in.

It is also worth calling out that this change will very likely destabilize the metric that I identified [here](#2729) as racy, which is likely masked by the mutation of the informer copies.

Fixes: #2734
@mattmoor
Copy link
Member Author

mattmoor commented Jun 4, 2020

One potential solution would be to replace the current logic with a simple periodic scan. You'd lose sub-{period} resolution, but it avoids a full lister scan (O(n)) that potentially keys off of updates to any of those O(n) resources.

If we launch a go routine with a for { <-time.After(period) } in controller.go it can key off of the same lister.

mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 5, 2020
This change migrates the PipelineRun controller to use `// +genreconciler`,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
   happen on a simple 30s period (see fixed issue below).

2. Change the way we update labels *back* to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to `2.`.  I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update.  With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:
1. Status sub-resource updates modify the entire resource (mismatches K8s).
2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted.  However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:
1. The label update will go through, and bump the resource version,
2. The status update will attempt to happen with the stale resource version and FAIL, but
3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
   thru the client, and retry the status update (this time succeeding).

This is done through new reactors in `test/controller.go`, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: tektoncd#2729
mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 5, 2020
This changes the metric reporting to happen periodically vs. being
triggered by reconciliation.  The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h).  With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

Fixes: tektoncd#2729
mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 5, 2020
This change migrates the PipelineRun controller to use `// +genreconciler`,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
   happen on a simple 30s period (see fixed issue below).

2. Change the way we update labels *back* to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to `2.`.  I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update.  With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:
1. Status sub-resource updates modify the entire resource (mismatches K8s).
2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted.  However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:
1. The label update will go through, and bump the resource version,
2. The status update will attempt to happen with the stale resource version and FAIL, but
3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
   thru the client, and retry the status update (this time succeeding).

This is done through new reactors in `test/controller.go`, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: tektoncd#2729
mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 5, 2020
This change migrates the PipelineRun controller to use `// +genreconciler`,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
   happen on a simple 30s period (see fixed issue below).

2. Change the way we update labels *back* to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to `2.`.  I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update.  With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:
1. Status sub-resource updates modify the entire resource (mismatches K8s).
2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted.  However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:
1. The label update will go through, and bump the resource version,
2. The status update will attempt to happen with the stale resource version and FAIL, but
3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
   thru the client, and retry the status update (this time succeeding).

This is done through new reactors in `test/controller.go`, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: tektoncd#2729
mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 8, 2020
This changes the metric reporting to happen periodically vs. being
triggered by reconciliation.  The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h).  With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

Fixes: tektoncd#2729
mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 8, 2020
This change migrates the PipelineRun controller to use `// +genreconciler`,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
   happen on a simple 30s period (see fixed issue below).

2. Change the way we update labels *back* to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to `2.`.  I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update.  With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:
1. Status sub-resource updates modify the entire resource (mismatches K8s).
2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted.  However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:
1. The label update will go through, and bump the resource version,
2. The status update will attempt to happen with the stale resource version and FAIL, but
3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
   thru the client, and retry the status update (this time succeeding).

This is done through new reactors in `test/controller.go`, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: tektoncd#2729
tekton-robot pushed a commit that referenced this issue Jun 8, 2020
This changes the metric reporting to happen periodically vs. being
triggered by reconciliation.  The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h).  With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

Fixes: #2729
mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 8, 2020
This change migrates the PipelineRun controller to use `// +genreconciler`,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
   happen on a simple 30s period (see fixed issue below).

2. Change the way we update labels *back* to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to `2.`.  I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update.  With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:
1. Status sub-resource updates modify the entire resource (mismatches K8s).
2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted.  However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:
1. The label update will go through, and bump the resource version,
2. The status update will attempt to happen with the stale resource version and FAIL, but
3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
   thru the client, and retry the status update (this time succeeding).

This is done through new reactors in `test/controller.go`, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: tektoncd#2729
tekton-robot pushed a commit that referenced this issue Jun 8, 2020
This change migrates the PipelineRun controller to use `// +genreconciler`,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
   happen on a simple 30s period (see fixed issue below).

2. Change the way we update labels *back* to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to `2.`.  I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update.  With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:
1. Status sub-resource updates modify the entire resource (mismatches K8s).
2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted.  However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:
1. The label update will go through, and bump the resource version,
2. The status update will attempt to happen with the stale resource version and FAIL, but
3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
   thru the client, and retry the status update (this time succeeding).

This is done through new reactors in `test/controller.go`, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: #2729
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
3 participants