Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change from individual CDash error emails to daily summary emails for the ATDM Trilinos builds (and perhaps other efforts) #2933

Closed
bartlettroscoe opened this issue Jun 13, 2018 · 19 comments
Assignees
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project stage: in progress Work on the issue has started type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jun 13, 2018

CC: @fryeguy52, @trilinos/framework, @dridzal

Description

After having to triage the promoted ATDM Trilinos builds for a couple of months now, and from extensive experience on other projects like CASL VERA, I have come to the realization that relying on CDash error emails is not a very effective notification and monitoring scheme in many of these situations. The reasons that CDash error emails are not effective for keeping on top of a lot of builds is that:

  1. It is hard to tell if a failing test is new that day or has been failing for multiple days or if that same test is failing across several builds. (All you get is a single email telling you that there is a failure for that one build.)

  2. When a failure does occur that results in a CDash error email, there is an urgency to address the problem ASAP (by either fixing, disabling, or reverting commits) in order to make the CDash error email go away. Otherwise, repeated CDash error emails day after day makes people accustomed to seeing CDash error emails and therefore new failures are ignored (and many people will create email filters and just ignore them from that point on).

  3. Catastrophic failures due to system issues can occur that result in a huge number of CDash error emails that can spam people (sometimes a Trilinos developer can get a dozen or more emails since the are on several different package regression lists). This can occur for many reasons like the disk filling up, or when the Intel license server goes down, or when a module does not load correctly. The huge glut of CDash error emails that can occur in these cases can obscure new real failures and can cause some people to add email filters (which then makes the CDash error emails worthless).

Instead of relying on individual CDash error emails, we could move to a notification scheme that created a single email each day that summarized the builds and tests and gave some information about the history of failing tests. Such a system could solve all of the problems listed above and make top-level triaging and monitoring of a bunch of related builds much easier.

(NOTE: Really CDash error notification emails are the best solution for a small number of post-push CI builds that you expect to fail only very rarely and you need a notification ASAP. For nightly builds, they are not effective for the reasons described above.)

Possible Solution

It seems that a straightforward solution would be to write a Python script that extracted data off of the CDash site using multiple queries using the API interface that provides data as JSON data-structures. The Python script would analyze the data and create an HTML-formatted email with useful summary information and CDash URL links.

The full specification is given at:

The input that would would provide to the Python script would be:

  • Name of the set of builds being analyzed (e.g. "ATDM Trilinos Builds")

  • Base CDash site (e.g. "https://testing-vm.sandia.gov/cdash/")

  • CDash project name (e.g. "project=Trilinos")

  • Current testing day (e.g. "YYYY-MM-DD")

  • CDash query URL fields (minus data, project, etc.) for queryTests.php to determine tests to be examined

  • CDash query URL fields (minus data, project, etc.) for index.php for list of builds to be examined

  • List of expected builds in the triplet ('site', 'build-name', 'group')

Given this data, the Python script would run queries and extract data off of the queryTests.php page for the current day and the previous two testing days (using the data=YYYY-MM-DD URL field) and then display that data as described below (sorted into various lists).

The Python script would then run the query on the index.php page and would note the builds that had any configure, build or test failures (including "not run" tests) and it would compare the list of builds extracted to the input list of expected builds and then note the expected builds that did not show up.

Then the Python script would construct an HTML-formatted email with the body having the following data:

  • (limited) List of tests that failed today but not the previous day (t1=??? in summary line)

  • (limited) List of tests that failed today and the previous day but not the day before that (t2=??? in summary line)

  • (limited) List of tests that failed today and the previous two consecutive days (t3+=??? in summary line)

  • Total number of "not-run" (non-disabled) tests for current testing day and CDash URL to that list (tnr=??? in summary line)

  • List of current-day builds that had any configure, build, or test failures (including "not run" tests) (b=??? is the sum of the build failures in those builds shown in summary line)

  • List of missing expected builds or builds that exist and pass the configure but don't have test results (meb=??? in summary) (NOTE: The current CDash implementation will only alert about missing expected builds but it will not alert about builds with missing tests.)

  • Total number of builds run and URL to the list of builds.

  • Total number of failing tests for the current testing day and the CDash URL

  • URL(s) to the list of all failing tests for the current day (but excluding "not run" tests)

The summary line for the email could be something like:

FAILED (t1=2, t2=1, t3+=5, tnr=18, b=3, meb=1): ATDM Trilinos Builds 

That email summary message would look similar to the ones that CDash sends out and one could see just in the summary line how many tests newly failed in the current testing day (i.e. t1=2), how many tests failed in the last two consecutive days (i.e. t2=2) and how many tests failed in the last three or more consecutive days (i.e. t3+=5). It would also show if there were any build failures (i.e. b=3) and how may tests were not run (tnr=18). Lastly, it would show if there were any missing expected builds (meb=1).

For the ATDM Trilinos builds, we could run this script on a cron job or a Jenkins job after 12 midnight MT or so (or wait until 5 am to allow all of the jobs to finish).

Other data we might consider reporting on and showing are:

  • Number of, URL to, and (limited) list of newly passing tests for the current testing that failed the previous day (or the last day that the matching builds had any test results) (tnp=??? in summary line)

  • Number of, URL to, and (limited) list of newly missing tests compared to yesterday (but only if the build ran the current day and the tests ran for that build and likewise for the previous day) (tnm=??? in summary line)

The above two bits of data would really help in determining that failing tests got resolved (either by fixing them or temporarily disabling them).

And since you would only get one email, then I think it would be good to send out the email with the summary line:

PASSED (tnp=2, tnm=1): ATDM Trilinos Builds 

and that email would contain links to the set of 100% passing builds!

That is an email that even a manager might want to get :-)

This script could also allow you to specify a set of "expected may fail" tests which would be provided in an array with the four fields [<test-name>, <build-name>, <site-name>, <github-issue-link>] and any failing tests that matched this criteria would be listed in their own sublist in the email could could be given tef=??? in the summary line. These failing tests would not be counted against global pass/fail when the fail but if they go from failing to passing, that would be listed along with the other "newly passing tests" (e.g. tnp). However, a better way to handle this would be to have CTest/CDash mark such tests as EXPECTED_MAY_FAIL as described in this CTest/CDash backlog item and then this script would automatically handle these tests differently without having to provide a separate list to this script. However, allowing someone to label a certain test as "expected may fail" specifically in this script would allow different customers to handle the same test differently. For example, one customer might consider a failing MueLu test as a show stopper and affect global pass/fail while another may not and therefore want to handle it as an "Expected may fail" and not affect global pass/fail. You can't do that with a single CTest/CDash property for each test. But without direct CTest/CDash support, the email body would list out the failing test along with the <github-issue-link> so one could immediately go to that issue to see how that failing test is being addressed.

Even for tests that we did not want to mark as "Expected may fail" (and therefore taken out of global pass/fail logic), it would also be useful to mark known failing tests that we did want to impact global pass/fail, it would also be nice to mark them with the GitHub issue ID if the failure is known and is being tracked. This could be done by passing in an array of "Known failing" tests with entries [<test-name>, <build-name>, <site-name>, <github-issue-link>]. This would be useful to see when looking at the summary email to know if we needed to triage those tests or not. (That is, if one sees failing tests that have failed for more than one consecutive day that don't have a GitHub Issue associated with them, then that would be a trigger to triage the failure and create a new Trilinos GitHub issue and then add to the list of "Expected may fail" tests or "Known failing" tests lists).

The script could also allow you to specify some "flaky" or "unstable" builds as an array of [<build-name>, <site-name>] entries where we expect random test failures. If a test failed in one of these "flaky" or "unstable" builds, then it would be reported in a separate section of the email and would not count toward the global pass/fail. Currently (as of 7/14/2018) we would categorize all of the ATDM Trilinos builds on 'ride' (see #2511) and the builds on 'mutrino' (see TRIL-214) in this category. That way, we could keep track of these builds in case something big went wrong but the they would not count toward global pass/fail (and therefore would not disrupt automated processes that update Trilinos between branches and application customers). But if more than a small number of test failures occurred (e.g. 4 tests per build) then this could impact global pass/fail. This would avoid a new catastrophic failure on one of these platforms from allowing an update of Trilinos to an ATDM APP, for example.

Tasks:

  1. Get initial script working that keeps track of failing tests with existing GitHub issue trackers can detect new failing tests that need to be triaged and get basic unit tests in place (see "TODO.txt" file in 'atdm-email' branch of 'TrilinosATDMStatus' repo and 'atdm-email' in TriBITS branch) ... PROGRESS ...
  2. Set up mailman list and Jenkins job to run script and post emails to the mailman list (and we can sign up for the mail list). (The mail list will also provide an archive of past results). (There should be a different mail list for different types of results; .e.g. one for the main "Promoted ATDM Trilinos Builds", a different one for "Specialized ATDM Trilinos Builds", etc.)
  3. Create documentation about the script somewhere and put in links to this documentation in the generated HTML emails somehow.
  4. Flesh out the script to cover all of the types of failures we need to keep track of.
  5. ???
@bartlettroscoe bartlettroscoe added type: enhancement Issue is an enhancement, not a bug stage: ready The issue is ready to be worked in a Kanban-like process client: ATDM Any issue primarily impacting the ATDM project labels Jun 13, 2018
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 13, 2018

Between a python script that I wrote for CASL several years ago that extracts data from Trac and sends out nice HTML-formatted emails and another Python script I wrote for CASL that extracts and analyzes data off of CDash using the JSON API, I think know exactly how to write this Python script. My guess is it would take 2-4 full 8-hour days to write, unit test, and deploy. This would be a huge improvement in how we manage the ATDM Trilinos builds and I suspect this would help other projects as well.

@mhoemmen
Copy link
Contributor

I like the summary e-mail idea :)

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 13, 2018

For ATDM Trilinos builds, it would be nice to set up a Mailman email list (e.g. 'atdm-trilinos-builds') and send these emails to that email list. Then people could sign up to get emails by registering for that email list. And we would have a nice searchable archive of emails for previous days. That would make it so easy to keep on top of these builds with less overhead. The more I think about this, the more excited I am to make this switch.

We just need to prioritize this and then get this done. I think we need to have this tool and process in place before we turn over the triagging of the ATDM Trilinos builds to others since it will make it easier to to remind them to address issues and make it clear what issues need to be addressed and when (e.g. when a test fails for two consecutive days on the same build on the same machine, you must create a Trilinos GitHub issue, and any tests that fall in the t2 list meet this criteria).

@mhoemmen
Copy link
Contributor

@bartlettroscoe Why not have the Python script automatically create the GitHub issue?

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 13, 2018

Why not have the Python script automatically create the GitHub issue?

@mhoemmen, are you being serious? The thing is that we need to weed out system issues like Intel compiler license server problems, disk filling up, etc. We need human do top-level triage before we hand this off to Trilinos developers with a GitHub issue.

@mhoemmen
Copy link
Contributor

@bartlettroscoe Good point; never mind :-) .

@bartlettroscoe
Copy link
Member Author

@mhoemmen,

The other reason that a human needs to create a Trilinos GitHub issue is that often times you want a single issue that covers a bunch of failing tests. For example, after the recent Kokkos and KokkosKernels update, several Kokkos, KokkosKernels and Panzer tests started failing for which I triagged them and created a single issue #2827. What type of mess would have been created if an automated tool created 11 different GitHub issues (one for each failing test)? That would not have been good.

Now a tool that helps us create a GitHub issue that creates some tables and puts in some CDash URLs may not be a bad idea and Kitware says that they have a tool that they wrote for another customer that perhaps we could use. I have not looked at it but I am not sure if that would be too verbose or not provide enough info to be useful. But it is worth looking into.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jul 14, 2018

FYI: It occurred to me that this script could provide special handling for "Expected to fail" tests and "flaky"/"unstable" builds as a superior way to handle known failing tests that we would otherwise disable and builds that are known to have random system-related failures that result in test failures like we see currently on 'white'/'ride' due to bsub crashing (see TRIL-198) and on 'mutrino' (see TRIL-214). Adding these would make for a more robust global pass/fail to drive automated sync processes for the ATDM applications but would still allow us to keep an eye on these tests and builds and avoid needing to constantly be disabling failing tests and enabling those tests once they should be fixed.

@bartlettroscoe
Copy link
Member Author

@fryeguy52, as we discussed, please take a shot at filling in a more complete specification ast:

and play with the script some in the repo:

@bartlettroscoe bartlettroscoe added stage: in progress Work on the issue has started and removed stage: ready The issue is ready to be worked in a Kanban-like process labels Aug 30, 2018
@bartlettroscoe
Copy link
Member Author

@fryeguy52,

FYI: I spent a lot of time today updating the specification for this email at:

we still need to provide some more examples for the types of data to display in the emails but I think this is getting closer to what we want. We can talk about this more once we have the current promoted ATDM Trilinos builds cleaned up and the builds on 'waterman' cleaned up and promoted to the "ATDM" group.

@bartlettroscoe
Copy link
Member Author

We desperately need the beginnings of this script. For this first script, we will focus on just failing tests with and without issue trackers. So we will implement the use case Failed with only failing tests with and without issue trackers but we will skip the table columns "must recent failure" and "# fails last 4 weeks". And we will only provide inputs to the script that can produce that.

@bartlettroscoe bartlettroscoe removed the stage: in progress Work on the issue has started label Sep 25, 2018
@bartlettroscoe bartlettroscoe added the stage: in progress Work on the issue has started label Sep 25, 2018
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Oct 9, 2018

@fryeguy52, it occurred to me that a few things that might speed up your development and testing of this Python script are:

  • Add argument --limit-test-history-days=<num_days> to allow speeding up queries.
  • Add an argument --cache-cdash-queries-to-dir=<dir> where the script could cache the data from the CDash queries that it performs. (That will make it easy to gather data to create unit tests and for other purposes.)
  • Add an argument --construct-from-cache to skip the CDash queries and just uses previously cached files. (That will aid in driving development and unit testing the entire script.)
  • Add argument --skip-send-email that will skip actually sending the email but instead will just print out the email it would have sent to STDOUT. (That will make driving development and unit testing the entire script easier.)
  • Add argument --print-email. (That will allow unit testing when --skip-send-email is sent.)

See the new "TODO.txt" file now committed to the 'atdm-email' branch of the TrilinosATDMStatus repo that has the full up-to-date list.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Oct 12, 2018

@fryeguy52

It occurred to me that with all of the lists handled as CSV files, that we could make this script a command-line tool that is directly committed to the TriBITS/tribits/ci_support/ directory. Then the TrilinosATDMStatus repo would just contain the various CSV data files and some driver bash scripts for the emails that we want to send out. That way, it would be easy for Trilinos developers (or other users) to use this script to monitor their own CDash projects. And it would make it easy to for us to create other summary emails like subsets of the "Specialized" builds.

It would be good to get a pretty solid first cut of this script done by the end of next week so that we can demonstrate it at the TUG on Thursday 10/25/2018 at my "Trilinos Stability Impact on ATDM" talk shown at:

So I would like to help and see if we can work in parallel a little to get this done next week.

Let's talk about this on Monday 10/15/2018.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 14, 2018
…nos#2933)

See the justification in updated comments and documentation in this commit.
This change is being make do that the CDash query script being developed in
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Dec 5, 2018
This is needed in order to ensure that the Experimental builds don't have the
same site and build name as builds going to the other Nightly CDash groups
that ATDM Trilinos groups go to.  Simply using the real 'hostname' for the
CDash build site name is not enough because for 'mutrino' the 'hostname'
actually is 'mutrino'!  Therefore, we need to add '-exp' to account for this
on machines where the CDash site name is the same as the 'hostname'.
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…nos#2933)

See the justification in updated comments and documentation in this commit.
This change is being make do that the CDash query script being developed in
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…or cee-rhel6 intel builds (trilinos#3891)

There is currently a strange link error.  Only this one test execuable is
impacted.

This test links and runs just fine in the other 'cee-rhel6' builds so
disabling it for now in these Intel builds is not so terrible and this can
still be fixed offline.  By disabling it for now, we remove a lot of red spam
from the CDash site and the ATDM Trilinos summary emails (see trilinos#2933).
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
This is needed in order to ensure that the Experimental builds don't have the
same site and build name as builds going to the other Nightly CDash groups
that ATDM Trilinos groups go to.  Simply using the real 'hostname' for the
CDash build site name is not enough because for 'mutrino' the 'hostname'
actually is 'mutrino'!  Therefore, we need to add '-exp' to account for this
on machines where the CDash site name is the same as the 'hostname'.
@bartlettroscoe bartlettroscoe self-assigned this Jan 25, 2019
bartlettroscoe pushed a commit to TriBITSPub/TriBITS that referenced this issue Mar 10, 2019
These are the squashed commits of the initial implementation of the CDash
query and reporting tool.
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Mar 10, 2019
This is the squashed commit of the implementation on the 'cdash-email' branch
as of 3/10/2019.  We will now do development on the 'master' branch.
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Mar 10, 2019
This and prior commits preserve the initial implementation for future
reference.
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Mar 18, 2019
…nalyze_and_report (trilinos/Trilinos#2933)

This commit is just the name changes of the files, not fixing all the
references.  This is so taht git will follow the renamed files correctly.
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Mar 21, 2019
…e' (trilinos/Trilinos#2933)

For the testing day 2019-03-20 the SPARC Trilinos Integration script crashed
with the error:

  listOfDicts[1004]['time'] = '0.22' != listOfDicts[1002]['time'] = '0.23'

I reported this to Kitware but I don't know if they will fix it.  Therefore,
to make the tool robust for crasy behavor like this, I added code and unit
tests to allow a 0.1 relative error in the test 'time' to be considered the
same test.

This allows the SPARC Trilinos Integration script to run correctly on testing
day 2019-03-20.

Build/Test Cases Summary
Enabled Packages:
Enabled all Packages
0) MPI_DEBUG => passed: passed=342,notpassed=0 (1.32 min)
1) SERIAL_RELEASE => passed: passed=342,notpassed=0 (1.60 min)
@bartlettroscoe
Copy link
Member Author

@dridzal, while we still have some more work to do on this CDash analysis and reporting Python tool (now called cdash_analyze_and_report.py), the basic functioning is pretty solid and is very useful (at least for staying on top of the ATDM Trilinos builds). And it has been merged to TriBITS 'master' and Trilinos 'develop'. Therefore, if you are interested in setting up to use this tool to monitor ROL builds on CDash, let me know and I can help you set it up. It should only take a few minutes. But we should talk about the best way to use this tool, IMHO, to help you stay on top of your builds and make sure you never miss new failing tests (otherwise the tool and CDash are not much use).

@bartlettroscoe
Copy link
Member Author

This has really been done for a long time. We developed cdash_analyze_and_report.py as a simple command-line tool and it has been in TriBITS and snapshotted into Trilinos for many months. We have been successfully using this to keep on top of the ATDM Trilinos builds with fairly good success. To see how we are running this, look at:

and the scripts it called in that protected repo.

Now there are several features that we want to add as documented in the version-controlled file:

We will add new GitHub issues for new features as we work on them (likely the the TriBITS github repo or even our ATDV JIRA project).

Closing as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project stage: in progress Work on the issue has started type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

3 participants