[CI] Add automatic test group re-running for flaky tests #44927

brianseeders · 2019-09-05T19:15:17Z

Once #44925 is finished, it will be pretty easy to add automatic re-running of individual ciGroups to add some protection and against flaky tests.

In subsequent work, we will likely be removing ciGroups altogether, in favor of running small test suites in a sort of queue-based fashion. In this case, re-running tests will mean re-running a small suite instead of a full ciGroup.

Other considerations:

How should we make visible / notify users about flaky tests? Should that be a separate issue?
Will we be re-running ciGroups so often (on PRs that have constantly breaking tests, for example), that it's not worth it?

elasticmachine · 2019-09-05T19:15:18Z

Pinging @elastic/kibana-operations

spalger · 2019-09-11T17:28:05Z

I think we should probably focus on #44934 first, so that retries are lower impact. Additionally, I think it's important that this effort also include reporting to ES what suites needed to be retried, and what the result of the retry was. We should be able to figure out from there what suites are flaky and assign work that way, hopefully eventually automating that too.

brianseeders · 2020-01-06T22:21:54Z

There's a draft PR where I've implemented something that works here: #53961

There are minimal changes to reporting / alerting / github checks / etc and it maybe works close to well enough for a trial run.

As currently implemented:

The first time a ciGroup fails, it is retried. If the second attempt succeeds, the build is marked as unstable. If it fails, it's a normal build failure.
If a second failure happens anywhere else in the build (e.g. another test fails one time), the whole build is marked as a failure. This is mostly to protect against a scenario where every single ciGroup retries once because something broke every suite. The number of flaky failures the build will tolerate is configurable, it's just currently set to 1.
The GitHub check for a flaky ciGroup turns red after the first run, but back to green after the second run (which would mean the PR is mergeable with a flaky failure).
The GitHub PR comment is yellow and indicates that it was successful with flakiness
There are unique test results in Jenkins for the failure of a test and the success.
This means that overall build time could be increased by roughly the longest x-pack cigroup in the worst case, but most ciGroups retrying shouldn't really affect much.
kibana-build would get e-mails about unstable builds on tracked branches
GitHub issues will get created for test failures on tracked branches, even if the test succeeds the second time
It's essentially up to a PR author to check and make sure they didn't introduce new flakiness into the build. It wouldn't be any different, however, than merge upstream + merge on green, which I'm guessing happens now sometimes.

To combat the potentially longer build times, we could (not sure how I feel about this part yet):

Turn off Github webhook triggers for 7.x/master (webhooks never really trigger builds, because the timer fires before the current build is finished)
Allow 7.x/master builds to run concurrently

Which would mean that:

We would get ~24 builds per day instead of ~18 for those branches
There will be some build overlap, but they should always finish in order unless a build exits very early (like during bootstrap)

Any thoughts on any of this?

tylersmalley · 2020-01-08T00:53:27Z

This is great. Really impressed by how feature complete this already feels.

I agree that the Github triggers are more or less providing continuous builds during U.S. workday. How busy is CI on the tracked branches during the rest of the time? If it's pretty much the same, I like the idea of moving to an interval while allowing concurrency to provide more consistency.

brianseeders · 2020-01-08T16:56:20Z

Let me clarify: I'm saying that, at least at the moment, Github triggers aren't really doing anything at the moment for master and 7.x.

master and 7.x are set to build every hour without concurrent builds. The builds take about 1h20m+, which means the timer fires for the next build before the current build is complete. This means we're continuously building all day long. So, turning off Github triggers wouldn't affect when builds happen, because we always start a new build as soon as the previous one finishes.

So, if we turned them off and allowed concurrent timer builds, we'd actually get more builds per day, even if some of the builds get longer because of flaky ciGroups retrying.

brianseeders added Team:Operations Team label for Operations Team Feature:CI Continuous integration labels Sep 5, 2019

brianseeders mentioned this issue Jan 16, 2020

[CI] Retry flaky tests #53961

Merged

brianseeders closed this as completed in #53961 Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add automatic test group re-running for flaky tests #44927

[CI] Add automatic test group re-running for flaky tests #44927

brianseeders commented Sep 5, 2019

elasticmachine commented Sep 5, 2019

spalger commented Sep 11, 2019

brianseeders commented Jan 6, 2020 •

edited

Loading

tylersmalley commented Jan 8, 2020

brianseeders commented Jan 8, 2020

[CI] Add automatic test group re-running for flaky tests #44927

[CI] Add automatic test group re-running for flaky tests #44927

Comments

brianseeders commented Sep 5, 2019

elasticmachine commented Sep 5, 2019

spalger commented Sep 11, 2019

brianseeders commented Jan 6, 2020 • edited Loading

tylersmalley commented Jan 8, 2020

brianseeders commented Jan 8, 2020

brianseeders commented Jan 6, 2020 •

edited

Loading