Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tekton modifies objects in the informer cache #2734

Closed
mattmoor opened this issue Jun 3, 2020 · 3 comments · Fixed by #2736
Closed

Tekton modifies objects in the informer cache #2734

mattmoor opened this issue Jun 3, 2020 · 3 comments · Fixed by #2736
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mattmoor
Copy link
Member

mattmoor commented Jun 3, 2020

Expected Behavior

Tekton should .DeepCopy() any resource that is fetched from an informer's Lister, otherwise the resource in the underlying cache is altered and bad things result (TBH, I'm not even sure of the extent, but this is a "no no").

Actual Behavior

In a few places Tekton updates the resource returned from the informer cache.

Steps to Reproduce the Problem

I outlined this a bit here: #2729 (comment), and also highlighted a few of the problem spots.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

/kind bug

I'm currently looking at this in spare cycles, but would be open to handing off to someone with more bandwidth to run down the issues this has dislodged.

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 3, 2020
@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

One complication here is that the single control loop is performing multiple Lister reads -> Client writes, but given that the fakes aren't updating the informers index during the client writes the second Lister read -> Client write reads a stale version of the object from the Lister cache.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

One partial solution to the above (since the same will happen during normal reconciliation) is to use a Patch like this, which bypasses the stale read and optimistic concurrency checks to update the labels and annotations:

-               newPr.ObjectMeta.Labels = pr.ObjectMeta.Labels
-               newPr.ObjectMeta.Annotations = pr.ObjectMeta.Annotations
-               return c.PipelineClientSet.TektonV1beta1().PipelineRuns(pr.Namespace).Update(newPr)
+               mergePatch := map[string]interface{}{
+                       "metadata": map[string]interface{}{
+                               "labels":      pr.ObjectMeta.Labels,
+                               "annotations": pr.ObjectMeta.Annotations,
+                       },
+               }
+               patch, err := json.Marshal(mergePatch)
+               if err != nil {
+                       return nil, err
+               }
+               return c.PipelineClientSet.TektonV1beta1().PipelineRuns(pr.Namespace).Patch(pr.Name, types.MergePatchType, patch)

I'm working through some other test failures as well, at least one of which is a test bug that was masked by this bug.

mattmoor added a commit to mattmoor/pipeline that referenced this issue Jun 3, 2020
In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method.

In pseudo-code:

```go
func update(fr *FooRun) {
  newFr := lister.Get(fr.Name)

  if reflect.DeepEqual(newFr.Field, fr.Field) {
    newFr.Field = fr.Field   // This modified the informer's copy!
    return client.Update(newFr)
  }
}
```

I have worked around this in two different ways:

1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy.

2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks.  This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update.  This also works around some fun interactions with the test code (see fixed issue).

There are two other notable aspects to this change:

1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered.  I changed most of these to refetch through the client.
2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in.

It is also worth calling out that this change will very likely destabilize the metric that I identified [here](tektoncd#2729) as racy, which is likely masked by the mutation of the informer copies.

Fixes: tektoncd#2734
tekton-robot pushed a commit that referenced this issue Jun 3, 2020
In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method.

In pseudo-code:

```go
func update(fr *FooRun) {
  newFr := lister.Get(fr.Name)

  if reflect.DeepEqual(newFr.Field, fr.Field) {
    newFr.Field = fr.Field   // This modified the informer's copy!
    return client.Update(newFr)
  }
}
```

I have worked around this in two different ways:

1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy.

2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks.  This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update.  This also works around some fun interactions with the test code (see fixed issue).

There are two other notable aspects to this change:

1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered.  I changed most of these to refetch through the client.
2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in.

It is also worth calling out that this change will very likely destabilize the metric that I identified [here](#2729) as racy, which is likely masked by the mutation of the informer copies.

Fixes: #2734
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants