Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Add automatic test group re-running for flaky tests #44927

Closed
brianseeders opened this issue Sep 5, 2019 · 5 comments · Fixed by #53961
Closed

[CI] Add automatic test group re-running for flaky tests #44927

brianseeders opened this issue Sep 5, 2019 · 5 comments · Fixed by #53961
Labels
Feature:CI Continuous integration Team:Operations Team label for Operations Team

Comments

@brianseeders
Copy link
Contributor

Once #44925 is finished, it will be pretty easy to add automatic re-running of individual ciGroups to add some protection and against flaky tests.

In subsequent work, we will likely be removing ciGroups altogether, in favor of running small test suites in a sort of queue-based fashion. In this case, re-running tests will mean re-running a small suite instead of a full ciGroup.

Other considerations:

  • How should we make visible / notify users about flaky tests? Should that be a separate issue?
  • Will we be re-running ciGroups so often (on PRs that have constantly breaking tests, for example), that it's not worth it?
@brianseeders brianseeders added Team:Operations Team label for Operations Team Feature:CI Continuous integration labels Sep 5, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations

@spalger
Copy link
Contributor

spalger commented Sep 11, 2019

I think we should probably focus on #44934 first, so that retries are lower impact. Additionally, I think it's important that this effort also include reporting to ES what suites needed to be retried, and what the result of the retry was. We should be able to figure out from there what suites are flaky and assign work that way, hopefully eventually automating that too.

@brianseeders
Copy link
Contributor Author

brianseeders commented Jan 6, 2020

There's a draft PR where I've implemented something that works here: #53961

There are minimal changes to reporting / alerting / github checks / etc and it maybe works close to well enough for a trial run.

As currently implemented:

  1. The first time a ciGroup fails, it is retried. If the second attempt succeeds, the build is marked as unstable. If it fails, it's a normal build failure.
  2. If a second failure happens anywhere else in the build (e.g. another test fails one time), the whole build is marked as a failure. This is mostly to protect against a scenario where every single ciGroup retries once because something broke every suite. The number of flaky failures the build will tolerate is configurable, it's just currently set to 1.
  3. The GitHub check for a flaky ciGroup turns red after the first run, but back to green after the second run (which would mean the PR is mergeable with a flaky failure).
  4. The GitHub PR comment is yellow and indicates that it was successful with flakiness
  5. There are unique test results in Jenkins for the failure of a test and the success.
  6. This means that overall build time could be increased by roughly the longest x-pack cigroup in the worst case, but most ciGroups retrying shouldn't really affect much.
  7. kibana-build would get e-mails about unstable builds on tracked branches
  8. GitHub issues will get created for test failures on tracked branches, even if the test succeeds the second time
  9. It's essentially up to a PR author to check and make sure they didn't introduce new flakiness into the build. It wouldn't be any different, however, than merge upstream + merge on green, which I'm guessing happens now sometimes.

To combat the potentially longer build times, we could (not sure how I feel about this part yet):

  1. Turn off Github webhook triggers for 7.x/master (webhooks never really trigger builds, because the timer fires before the current build is finished)
  2. Allow 7.x/master builds to run concurrently

Which would mean that:

  1. We would get ~24 builds per day instead of ~18 for those branches
  2. There will be some build overlap, but they should always finish in order unless a build exits very early (like during bootstrap)

Any thoughts on any of this?

@tylersmalley
Copy link
Contributor

This is great. Really impressed by how feature complete this already feels.

I agree that the Github triggers are more or less providing continuous builds during U.S. workday. How busy is CI on the tracked branches during the rest of the time? If it's pretty much the same, I like the idea of moving to an interval while allowing concurrency to provide more consistency.

@brianseeders
Copy link
Contributor Author

Let me clarify: I'm saying that, at least at the moment, Github triggers aren't really doing anything at the moment for master and 7.x.

master and 7.x are set to build every hour without concurrent builds. The builds take about 1h20m+, which means the timer fires for the next build before the current build is complete. This means we're continuously building all day long. So, turning off Github triggers wouldn't affect when builds happen, because we always start a new build as soon as the previous one finishes.

So, if we turned them off and allowed concurrent timer builds, we'd actually get more builds per day, even if some of the builds get longer because of flaky ciGroups retrying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:CI Continuous integration Team:Operations Team label for Operations Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants