Flakey sidecar container test #2656

bobcatfish · 2020-05-20T16:01:08Z

Expected Behavior

TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete should reliably pass unless there is a bug

Actual Behavior

TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete failed on #2652 with this error in the logs:

=== RUN   TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:82: Creating Task "sidecar-test-task-0"
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:87: Creating TaskRun "sidecar-test-task-run-0"
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:111: Error waiting for Pod "sidecar-test-task-run-0-pod-f47gv" to terminate both the primary and sidecar containers: timed out waiting for the condition
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:132: Sidecar container has a nil Terminated status but non-nil is expected.
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:140: Either the primary or sidecar containers did not terminate
=== RUN   TestSidecarTaskSupport/A_sidecar_that_terminates_early_does_not_cause_problems_running_Steps
    TestSidecarTaskSupport/A_sidecar_that_terminates_early_does_not_cause_problems_running_Steps: sidecar_test.go:82: Creating Task "sidecar-test-task-1"
    TestSidecarTaskSupport/A_sidecar_that_terminates_early_does_not_cause_problems_running_Steps: sidecar_test.go:87: Creating TaskRun "sidecar-test-task-run-1"
--- FAIL: TestSidecarTaskSupport (610.28s)

In the k8s events you can see:

arendelle-mtdlj    27m         Normal    Scheduled                 pod/sidecar-test-task-run-0-pod-f47gv                                     Successfully assigned arendelle-mtdlj/sidecar-test-task-run-0-pod-f47gv to gke-tpipeline-e2e-cls126-default-pool-c2eb233a-b7p5
arendelle-mtdlj    27m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Container image "gcr.io/tekton-prow-9/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:4442c3261e3cdadd00c9d3581cccd00a8d68af741767611d405f2e4c0f36a660" already present on machine
arendelle-mtdlj    27m         Normal    Created                   pod/sidecar-test-task-run-0-pod-f47gv                                     Created container place-tools
arendelle-mtdlj    27m         Normal    Started                   pod/sidecar-test-task-run-0-pod-f47gv                                     Started container place-tools
arendelle-mtdlj    27m         Normal    Pulling                   pod/sidecar-test-task-run-0-pod-f47gv                                     Pulling image "busybox:1.31.0-musl"
arendelle-mtdlj    27m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Successfully pulled image "busybox:1.31.0-musl"
arendelle-mtdlj    27m         Normal    Created                   pod/sidecar-test-task-run-0-pod-f47gv                                     Created container step-primary
arendelle-mtdlj    27m         Normal    Started                   pod/sidecar-test-task-run-0-pod-f47gv                                     Started container step-primary
arendelle-mtdlj    27m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Container image "busybox:1.31.0-musl" already present on machine
arendelle-mtdlj    26m         Normal    Created                   pod/sidecar-test-task-run-0-pod-f47gv                                     Created container sidecar-sidecar-container
arendelle-mtdlj    27m         Normal    Started                   pod/sidecar-test-task-run-0-pod-f47gv                                     Started container sidecar-sidecar-container
arendelle-mtdlj    27m         Normal    Killing                   pod/sidecar-test-task-run-0-pod-f47gv                                     Container sidecar-sidecar-container definition changed, will be restarted
arendelle-mtdlj    26m         Normal    Pulling                   pod/sidecar-test-task-run-0-pod-f47gv                                     Pulling image "tianon/true@sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b"
arendelle-mtdlj    26m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Successfully pulled image "tianon/true@sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b"
arendelle-mtdlj    26m         Warning   Failed                    pod/sidecar-test-task-run-0-pod-f47gv                                     Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

Particularly interesting:

Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

Additional Info

Looks like #1253 had this same problem - looks like it's some kind of race condition related to switching to the nop image perhaps? (Or this is a red herring and this is just what happens when we kill the sidecar)

The text was updated successfully, but these errors were encountered:

jlpettersson · 2020-05-23T09:17:00Z

Also #1819

GregDritschler · 2020-05-28T21:18:14Z

I looked at some build logs where this test failed and noticed that the sidecar test pod has a status of RunContainerError:

arendelle-lvq6p    pod/sidecar-test-task-run-0-pod-knt5v                                0/2     RunContainerError    1          30m

In logs where the test passed, the pod has a status of Error. The difference is that RunContainerError appears to be used with container status waiting rather than terminated.

There isn't a lot of information out there about these different reasons and why you might get one vs another. I tried looking at the Kubernetes code but I'm not familiar enough with it to understand what might cause one vs the other.

I've not been able to reproduce this problem myself after running the test on my own clusters dozens of times. When I run it on my IBM cluster, the pod has status StartError instead of Error but the container status still is terminated. When I run it on my GKE cluster, the pod has status ContainerCannotRun but the container status still is terminated.

If someone is able to reproduce it, it would be interesting to display the pod and confirm the sidecar container is in waiting state. Also wait and see if Kubernetes does anything with the pod or it just sits there like that.

P.S. The message

Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

appears when the test passes as well. Somehow Kubernetes is reacting differently to it sometimes.

vincent-pli · 2020-05-31T08:47:14Z

Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

Will always be there, try any command with tianon/true will cause similar exception, since there is nothing in that image, no $PATH, no filesystem.

vdemeester · 2020-07-24T14:17:37Z

@bobcatfish is this still an issue ?
/kind flake

bobcatfish · 2020-09-01T20:29:23Z

I'm not sure, let's close it and we can re-open if/when it happens again

jerop · 2020-09-02T14:05:22Z

The test flaked again in #3135 -- https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/3135/pull-tekton-pipeline-integration-tests/1301144752749547520/

/reopen

tekton-robot · 2020-09-02T14:05:24Z

@jerop: Reopened this issue.

In response to this:

The test flaked again in #3135

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

popcor255 · 2020-09-24T12:53:06Z

I think this flaked again #3281 -- https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/3281/pull-tekton-pipeline-integration-tests/1309098906138710017/

tekton-robot · 2020-12-23T13:16:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-01-22T13:24:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-07T10:40:47Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-07T10:40:49Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ghost · 2021-11-18T13:03:26Z

/reopen

There was an instance of this failing in #4380

tekton-robot · 2021-11-18T13:03:28Z

@sbwsg: Reopened this issue.

In response to this:

/reopen

There was an instance of this failing in #4380

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ghost · 2021-11-18T13:03:39Z

/remove-lifecycle rotten

ghost · 2021-11-18T13:04:15Z

Copy/pasting my notes from #4380:

Here's a permanent link to the failed test run output.

Failed Test: test/v1alpha1 TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete

Ran for 10m12s according to the test runner. The TaskRun under test has its timeout field set to 1m, so that's quite a disparity.

The test runner reported the following:

    sidecar_test.go:100: Creating Task "sidecar-test-task-0"
    sidecar_test.go:105: Creating TaskRun "sidecar-test-task-run-0"
    sidecar_test.go:129: Error waiting for Pod "sidecar-test-task-run-0-pod" to terminate both the primary and sidecar containers: timed out waiting for the condition
    sidecar_test.go:150: Sidecar container has a nil Terminated status but non-nil is expected.
    sidecar_test.go:158: Either the primary or sidecar containers did not terminate

Both the sidecar and steps are reported as successful in the captured TaskRun, and both have a non-nil terminated field:

        status:
          completionTime: "2021-11-17T21:34:31Z"
          conditions:
          - lastTransitionTime: "2021-11-17T21:34:31Z"
            message: All Steps have completed executing
            reason: Succeeded
            status: "True"
            type: Succeeded
          podName: sidecar-test-task-run-0-pod
          sidecars:
          - container: sidecar-sidecar-container
            imageID: docker-pullable://gcr.io/tekton-prow-6/tpipeline-e2e-img/nop-8eac7c133edad5df719dc37b36b62482@sha256:acb6b4cd4957beee021d31a3ac1d533ecd4f3f35085a8881068a57e63953a14d
            name: sidecar-container
            terminated:
              containerID: docker://5e9c5e88e389caac2efad56790b50fb1bcfe3c575ec96fe68b10b33005e4fde7
              exitCode: 0
              finishedAt: null
              message: Sidecar container successfully stopped by nop image
              reason: Completed
              startedAt: null
          startTime: "2021-11-17T21:34:22Z"
          steps:
          - container: step-primary
            imageID: docker-pullable://busybox@sha256:e7157b6d7ebbe2cce5eaa8cfe8aa4fa82d173999b9f90a9ec42e57323546c353
            name: primary
            terminated:
              containerID: docker://55fb66dd40f5d695ea7378f1ed51d5eb895e37fe9cc117e3868f18ed6951ef61
              exitCode: 0
              finishedAt: "2021-11-17T21:34:30Z"
              reason: Completed
              startedAt: "2021-11-17T21:34:30Z"

The creation timestamp on the TaskRun is 2021-11-17T21:34:20Z so it looks like it took about 10 seconds from creation to termination, at least as far as the cluster-under-test was concerned.

ghost · 2021-11-18T14:15:24Z

I looked up the logs from the test runner and the logs from the temporary boskos cluster. The boskos cluster logs are filtered down to the exact namespace where the failed TaskRuns were reported executing by Prow. The timestamps of the log messages from Prow are fully ten minutes ahead of the log messages from the Pod in the temporary Boskos cluster.

Prow reports test RUN at: 2021-11-17 16:44:32.717 EST
Prow reports test FAIL at: 2021-11-17 16:44:32.725 EST

TaskRun Pod first log at: 2021-11-17 16:34:25.314 EST
TaskRun Pod last log at: 2021-11-17 16:34:31.031 EST

I'm not familiar enough with log collection in GCP or our prow cluster to definitively say that's abnormal but it strikes me as really weird!

Also weird: Prow reports the RUN and FAIL within 8 milliseconds of each other despite reporting a 10m12s timeout in the build-log.txt!

Theory: 10m12s is the exact delta between the Prow log line reporting FAIL and the creationTimestamp on the TaskRun. I wonder if our test compares the TaskRun's metadata.creationTimestamp against time.Now() and fails the test immediately.

Edit: I looked into this ^ briefly but our test code uses polling which doesn't involve comparing against any timestamps in the created resources. At least on first look this doesn't appear to be a likely culprit.

abayer · 2022-07-14T15:20:31Z

This probably has the same root cause as #4169.

tekton-robot · 2022-10-12T16:03:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-11-11T16:20:58Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-12-11T16:48:26Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-12-11T16:48:27Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bobcatfish assigned bobcatfish and unassigned bobcatfish May 20, 2020

bobcatfish added the kind/bug Categorizes issue or PR as related to a bug. label May 20, 2020

jlpettersson mentioned this issue May 24, 2020

Run CI YAML Example tests using go test #2685

Merged

3 tasks

tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Jul 24, 2020

bobcatfish mentioned this issue Aug 13, 2020

Flakyness with TestSideCar #1819

Closed

bobcatfish closed this as completed Sep 1, 2020

jerop mentioned this issue Sep 2, 2020

Conditions Beta: When Expressions #3135

Merged

4 tasks

tekton-robot reopened this Sep 2, 2020

bobcatfish mentioned this issue Dec 4, 2020

adding unit tests for pipelineTaskList.Deps #3597

Merged

4 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2020

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2021

tekton-robot closed this as completed May 7, 2021

tekton-robot reopened this Nov 18, 2021

tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 18, 2021

jerop mentioned this issue Feb 4, 2022

add default seccomp profile #4520

Merged

5 tasks

jerop added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2022

This was referenced Feb 15, 2022

debug is an alpha feature #4573

Merged

TEP-0059: Scope when expressions to Task only #4580

Merged

jerop mentioned this issue Feb 22, 2022

TEP-0090: Matrix - API Change and Feature Flag #4600

Merged

5 tasks

jerop mentioned this issue Mar 1, 2022

TEP-0090: Refactor execution status validation #4635

Merged

4 tasks

jerop mentioned this issue Mar 14, 2022

TEP-0090: Refactor Pipeline Workspaces Validation #4670

Merged

4 tasks

dibyom added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 25, 2022

wlynch mentioned this issue Aug 2, 2022

Bump TestSidecarTaskSupport test timeout to 2m. #5260

Merged

5 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 11, 2022

tekton-robot closed this as completed Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flakey sidecar container test #2656

Flakey sidecar container test #2656

bobcatfish commented May 20, 2020

jlpettersson commented May 23, 2020

GregDritschler commented May 28, 2020 •

edited

Loading

vincent-pli commented May 31, 2020

vdemeester commented Jul 24, 2020

bobcatfish commented Sep 1, 2020

jerop commented Sep 2, 2020 •

edited

Loading

tekton-robot commented Sep 2, 2020

popcor255 commented Sep 24, 2020 •

edited

Loading

tekton-robot commented Dec 23, 2020

tekton-robot commented Jan 22, 2021

tekton-robot commented May 7, 2021

tekton-robot commented May 7, 2021

ghost commented Nov 18, 2021

tekton-robot commented Nov 18, 2021

ghost commented Nov 18, 2021

ghost commented Nov 18, 2021

ghost commented Nov 18, 2021 •

edited by ghost

Loading

abayer commented Jul 14, 2022

tekton-robot commented Oct 12, 2022

tekton-robot commented Nov 11, 2022

tekton-robot commented Dec 11, 2022

tekton-robot commented Dec 11, 2022

Flakey sidecar container test #2656

Flakey sidecar container test #2656

Comments

bobcatfish commented May 20, 2020

Expected Behavior

Actual Behavior

Additional Info

jlpettersson commented May 23, 2020

GregDritschler commented May 28, 2020 • edited Loading

vincent-pli commented May 31, 2020

vdemeester commented Jul 24, 2020

bobcatfish commented Sep 1, 2020

jerop commented Sep 2, 2020 • edited Loading

tekton-robot commented Sep 2, 2020

popcor255 commented Sep 24, 2020 • edited Loading

tekton-robot commented Dec 23, 2020

tekton-robot commented Jan 22, 2021

tekton-robot commented May 7, 2021

tekton-robot commented May 7, 2021

ghost commented Nov 18, 2021

tekton-robot commented Nov 18, 2021

ghost commented Nov 18, 2021

ghost commented Nov 18, 2021

ghost commented Nov 18, 2021 • edited by ghost Loading

abayer commented Jul 14, 2022

tekton-robot commented Oct 12, 2022

tekton-robot commented Nov 11, 2022

tekton-robot commented Dec 11, 2022

tekton-robot commented Dec 11, 2022

GregDritschler commented May 28, 2020 •

edited

Loading

jerop commented Sep 2, 2020 •

edited

Loading

popcor255 commented Sep 24, 2020 •

edited

Loading

ghost commented Nov 18, 2021 •

edited by ghost

Loading