Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakey sidecar container test #2656

Closed
bobcatfish opened this issue May 20, 2020 · 22 comments
Closed

Flakey sidecar container test #2656

bobcatfish opened this issue May 20, 2020 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@bobcatfish
Copy link
Collaborator

Expected Behavior

TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete should reliably pass unless there is a bug

Actual Behavior

TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete failed on #2652 with this error in the logs:

=== RUN   TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:82: Creating Task "sidecar-test-task-0"
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:87: Creating TaskRun "sidecar-test-task-run-0"
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:111: Error waiting for Pod "sidecar-test-task-run-0-pod-f47gv" to terminate both the primary and sidecar containers: timed out waiting for the condition
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:132: Sidecar container has a nil Terminated status but non-nil is expected.
    TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete: sidecar_test.go:140: Either the primary or sidecar containers did not terminate
=== RUN   TestSidecarTaskSupport/A_sidecar_that_terminates_early_does_not_cause_problems_running_Steps
    TestSidecarTaskSupport/A_sidecar_that_terminates_early_does_not_cause_problems_running_Steps: sidecar_test.go:82: Creating Task "sidecar-test-task-1"
    TestSidecarTaskSupport/A_sidecar_that_terminates_early_does_not_cause_problems_running_Steps: sidecar_test.go:87: Creating TaskRun "sidecar-test-task-run-1"
--- FAIL: TestSidecarTaskSupport (610.28s)

In the k8s events you can see:

arendelle-mtdlj    27m         Normal    Scheduled                 pod/sidecar-test-task-run-0-pod-f47gv                                     Successfully assigned arendelle-mtdlj/sidecar-test-task-run-0-pod-f47gv to gke-tpipeline-e2e-cls126-default-pool-c2eb233a-b7p5
arendelle-mtdlj    27m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Container image "gcr.io/tekton-prow-9/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:4442c3261e3cdadd00c9d3581cccd00a8d68af741767611d405f2e4c0f36a660" already present on machine
arendelle-mtdlj    27m         Normal    Created                   pod/sidecar-test-task-run-0-pod-f47gv                                     Created container place-tools
arendelle-mtdlj    27m         Normal    Started                   pod/sidecar-test-task-run-0-pod-f47gv                                     Started container place-tools
arendelle-mtdlj    27m         Normal    Pulling                   pod/sidecar-test-task-run-0-pod-f47gv                                     Pulling image "busybox:1.31.0-musl"
arendelle-mtdlj    27m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Successfully pulled image "busybox:1.31.0-musl"
arendelle-mtdlj    27m         Normal    Created                   pod/sidecar-test-task-run-0-pod-f47gv                                     Created container step-primary
arendelle-mtdlj    27m         Normal    Started                   pod/sidecar-test-task-run-0-pod-f47gv                                     Started container step-primary
arendelle-mtdlj    27m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Container image "busybox:1.31.0-musl" already present on machine
arendelle-mtdlj    26m         Normal    Created                   pod/sidecar-test-task-run-0-pod-f47gv                                     Created container sidecar-sidecar-container
arendelle-mtdlj    27m         Normal    Started                   pod/sidecar-test-task-run-0-pod-f47gv                                     Started container sidecar-sidecar-container
arendelle-mtdlj    27m         Normal    Killing                   pod/sidecar-test-task-run-0-pod-f47gv                                     Container sidecar-sidecar-container definition changed, will be restarted
arendelle-mtdlj    26m         Normal    Pulling                   pod/sidecar-test-task-run-0-pod-f47gv                                     Pulling image "tianon/true@sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b"
arendelle-mtdlj    26m         Normal    Pulled                    pod/sidecar-test-task-run-0-pod-f47gv                                     Successfully pulled image "tianon/true@sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b"
arendelle-mtdlj    26m         Warning   Failed                    pod/sidecar-test-task-run-0-pod-f47gv                                     Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

Particularly interesting:

Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

Additional Info

Looks like #1253 had this same problem - looks like it's some kind of race condition related to switching to the nop image perhaps? (Or this is a red herring and this is just what happens when we kill the sidecar)

@bobcatfish bobcatfish assigned bobcatfish and unassigned bobcatfish May 20, 2020
@bobcatfish bobcatfish added the kind/bug Categorizes issue or PR as related to a bug. label May 20, 2020
@jlpettersson
Copy link
Member

Also #1819

@GregDritschler
Copy link
Contributor

GregDritschler commented May 28, 2020

I looked at some build logs where this test failed and noticed that the sidecar test pod has a status of RunContainerError:

arendelle-lvq6p    pod/sidecar-test-task-run-0-pod-knt5v                                0/2     RunContainerError    1          30m

In logs where the test passed, the pod has a status of Error. The difference is that RunContainerError appears to be used with container status waiting rather than terminated.

There isn't a lot of information out there about these different reasons and why you might get one vs another. I tried looking at the Kubernetes code but I'm not familiar enough with it to understand what might cause one vs the other.

I've not been able to reproduce this problem myself after running the test on my own clusters dozens of times. When I run it on my IBM cluster, the pod has status StartError instead of Error but the container status still is terminated. When I run it on my GKE cluster, the pod has status ContainerCannotRun but the container status still is terminated.

If someone is able to reproduce it, it would be interesting to display the pod and confirm the sidecar container is in waiting state. Also wait and see if Kubernetes does anything with the pod or it just sits there like that.

P.S. The message

Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

appears when the test passes as well. Somehow Kubernetes is reacting differently to it sometimes.

@vincent-pli
Copy link
Member

Error: failed to start container "sidecar-sidecar-container": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

Will always be there, try any command with tianon/true will cause similar exception, since there is nothing in that image, no $PATH, no filesystem.

@vdemeester
Copy link
Member

@bobcatfish is this still an issue ?
/kind flake

@tekton-robot tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Jul 24, 2020
@bobcatfish
Copy link
Collaborator Author

I'm not sure, let's close it and we can re-open if/when it happens again

@jerop
Copy link
Member

jerop commented Sep 2, 2020

The test flaked again in #3135 -- https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/3135/pull-tekton-pipeline-integration-tests/1301144752749547520/

/reopen

@tekton-robot tekton-robot reopened this Sep 2, 2020
@tekton-robot
Copy link
Collaborator

@jerop: Reopened this issue.

In response to this:

The test flaked again in #3135

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@popcor255
Copy link
Member

popcor255 commented Sep 24, 2020

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2020
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ghost
Copy link

ghost commented Nov 18, 2021

/reopen

There was an instance of this failing in #4380

@tekton-robot tekton-robot reopened this Nov 18, 2021
@tekton-robot
Copy link
Collaborator

@sbwsg: Reopened this issue.

In response to this:

/reopen

There was an instance of this failing in #4380

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ghost
Copy link

ghost commented Nov 18, 2021

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 18, 2021
@ghost
Copy link

ghost commented Nov 18, 2021

Copy/pasting my notes from #4380:

Here's a permanent link to the failed test run output.

Failed Test: test/v1alpha1 TestSidecarTaskSupport/A_sidecar_that_runs_forever_is_terminated_when_Steps_complete

Ran for 10m12s according to the test runner. The TaskRun under test has its timeout field set to 1m, so that's quite a disparity.

The test runner reported the following:

    sidecar_test.go:100: Creating Task "sidecar-test-task-0"
    sidecar_test.go:105: Creating TaskRun "sidecar-test-task-run-0"
    sidecar_test.go:129: Error waiting for Pod "sidecar-test-task-run-0-pod" to terminate both the primary and sidecar containers: timed out waiting for the condition
    sidecar_test.go:150: Sidecar container has a nil Terminated status but non-nil is expected.
    sidecar_test.go:158: Either the primary or sidecar containers did not terminate

Both the sidecar and steps are reported as successful in the captured TaskRun, and both have a non-nil terminated field:

        status:
          completionTime: "2021-11-17T21:34:31Z"
          conditions:
          - lastTransitionTime: "2021-11-17T21:34:31Z"
            message: All Steps have completed executing
            reason: Succeeded
            status: "True"
            type: Succeeded
          podName: sidecar-test-task-run-0-pod
          sidecars:
          - container: sidecar-sidecar-container
            imageID: docker-pullable://gcr.io/tekton-prow-6/tpipeline-e2e-img/nop-8eac7c133edad5df719dc37b36b62482@sha256:acb6b4cd4957beee021d31a3ac1d533ecd4f3f35085a8881068a57e63953a14d
            name: sidecar-container
            terminated:
              containerID: docker://5e9c5e88e389caac2efad56790b50fb1bcfe3c575ec96fe68b10b33005e4fde7
              exitCode: 0
              finishedAt: null
              message: Sidecar container successfully stopped by nop image
              reason: Completed
              startedAt: null
          startTime: "2021-11-17T21:34:22Z"
          steps:
          - container: step-primary
            imageID: docker-pullable://busybox@sha256:e7157b6d7ebbe2cce5eaa8cfe8aa4fa82d173999b9f90a9ec42e57323546c353
            name: primary
            terminated:
              containerID: docker://55fb66dd40f5d695ea7378f1ed51d5eb895e37fe9cc117e3868f18ed6951ef61
              exitCode: 0
              finishedAt: "2021-11-17T21:34:30Z"
              reason: Completed
              startedAt: "2021-11-17T21:34:30Z"

The creation timestamp on the TaskRun is 2021-11-17T21:34:20Z so it looks like it took about 10 seconds from creation to termination, at least as far as the cluster-under-test was concerned.

@ghost
Copy link

ghost commented Nov 18, 2021

I looked up the logs from the test runner and the logs from the temporary boskos cluster. The boskos cluster logs are filtered down to the exact namespace where the failed TaskRuns were reported executing by Prow. The timestamps of the log messages from Prow are fully ten minutes ahead of the log messages from the Pod in the temporary Boskos cluster.

Prow reports test RUN at: 2021-11-17 16:44:32.717 EST
Prow reports test FAIL at: 2021-11-17 16:44:32.725 EST

TaskRun Pod first log at: 2021-11-17 16:34:25.314 EST
TaskRun Pod last log at: 2021-11-17 16:34:31.031 EST

I'm not familiar enough with log collection in GCP or our prow cluster to definitively say that's abnormal but it strikes me as really weird!

Also weird: Prow reports the RUN and FAIL within 8 milliseconds of each other despite reporting a 10m12s timeout in the build-log.txt!

Theory: 10m12s is the exact delta between the Prow log line reporting FAIL and the creationTimestamp on the TaskRun. I wonder if our test compares the TaskRun's metadata.creationTimestamp against time.Now() and fails the test immediately.

Edit: I looked into this ^ briefly but our test code uses polling which doesn't involve comparing against any timestamps in the created resources. At least on first look this doesn't appear to be a likely culprit.

@jerop jerop mentioned this issue Feb 4, 2022
5 tasks
@jerop jerop added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2022
@dibyom dibyom added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 25, 2022
@abayer
Copy link
Contributor

abayer commented Jul 14, 2022

This probably has the same root cause as #4169.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 11, 2022
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
Status: Done
Status: Done
Development

No branches or pull requests

10 participants