Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test KokkosKernels_sparse_openmp_MPI_1 still (randomly) timing out in in build Trilinos-atdm-white-ride-gnu-debug-openmp on 'white' #3168

Closed
bartlettroscoe opened this issue Jul 23, 2018 · 7 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: KokkosKernels type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jul 23, 2018

@trilinos/kokkos-kernels, @srajama1 (Trilinos Linear Solver Product Lead)

Next Action Status

PR #3173 disabled this test in this build. There have been no timeouts of this test on any system since 9/21/2018 except in the the build Trilinos-atdm-mutrino-intel-opt-openmp-KNL (which is being handled #3864).

Description

As shown in this query the test KokkosKernels_sparse_openmp_MPI_1 timed out at 10 minutes in the build Trilinos-atdm-white-ride-gnu-debug-openmp on 'white'. But as one can see, the test was taking upwards of nearly 10 minutes to complete before in this build going back to 7/1/2018.

Steps to Reproduce

Use the build name gnu-debug-openmp (not cuda-debug) on the machine 'white' and enable the package KokkosKernels (not MueLu) using the commands shown at:

NOTE: One cannot currently reproduce this on 'ride' because of the upgrade of 'white' but not 'ride'. See TRIL-215.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: KokkosKernels client: ATDM Any issue primarily impacting the ATDM project labels Jul 23, 2018
@bartlettroscoe
Copy link
Member Author

Looks like the efforts in #2827 to selectively disable some individual unit tests in KokkosKernels_sparse_openmp_MPI_1 have not been successful to drop the runtime for this test on this build on this platform low enough to avoid timeouts.

Looking at all of the platforms where this test was run yesterday at where this test is getting run in a debug-openmp build, this is the only build where the test is taking close to 10 minutes. It is running and passing pretty quickly in intel-debug-openmp and gnu-debug-openmp builds on several other platforms as shown in this query. Therefore, I would suggest it is not much of a loss to disable this test in this build on 'white'. Besides, 'white' and 'ride' are supposed to be ATS-2 semi-clones of Sierra which uses GPUs and CUDA as the workhorse, so why do we care about OpenMP builds on that system anyway?

@mhoemmen
Copy link
Contributor

so why do we care about OpenMP builds on that system anyway?

It's pretty likely that many apps won't find it worthwhile to run on the GPUs, so it would be nice to test OpenMP a little bit :-)

@bartlettroscoe bartlettroscoe changed the title Test Trilinos-atdm-white-ride-gnu-debug-openmp still (randomly) timing out in in build Trilinos-atdm-white-ride-gnu-debug-openmp on 'white' Test KokkosKernels_sparse_openmp_MPI_1 still (randomly) timing out in in build Trilinos-atdm-white-ride-gnu-debug-openmp on 'white' Jul 23, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 23, 2018
…e/ride (trilinos#3168)

This test is run with a debug-openmp on other platforms and the test in the
gnu-opt-openmp build runs fine on this machine so this is not much of a
testing loss.

We really need to get away from these full debug-debug builds across all of
these platforms.
bartlettroscoe added a commit that referenced this issue Jul 24, 2018
…e/ride (#3168) (#3173)

This test is run with a debug-openmp on other platforms and the test in the
gnu-opt-openmp build runs fine on this machine so this is not much of a
testing loss.

We really need to get away from these full debug-debug builds across all of
these platforms.
@bartlettroscoe
Copy link
Member Author

PR #3173 was merged which disables this test. I will now put this into review and what till tomorrow 7/24/2018 to make sure it is disabled.

@bartlettroscoe bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Jul 24, 2018
@fryeguy52
Copy link
Contributor

this query shows that the last time the test was run for this build was on 7/22/2018 which means it was disabled. closing this issue

@bartlettroscoe
Copy link
Member Author

@fryeguy52, this same test is timing out in the same build gnu-debug-openmp on 'waterman' as shown here. Therefore, I think we need to disable this test for that build as well.

I will reopen this issue just as a reminder to do that.

@bartlettroscoe
Copy link
Member Author

@fryeguy52, actually, our policy is not close GitHub issues if we disable tests but to just put the "Disabled Tests" label on them and leave them open.

@bartlettroscoe bartlettroscoe added the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Nov 29, 2018
@bartlettroscoe
Copy link
Member Author

As shown in this query, this test has only failed or timed-out in the 'mutrino' builds Trilinos-atdm-mutrino-intel-opt-openmp-HSW and Trilinos-atdm-mutrino-intel-opt-openmp-KNL since 9/21/2018.

The timeouts in the build Trilinos-atdm-mutrino-intel-opt-openmp-KNL are being addressed in #3864. The one failures in the build Trilinos-atdm-mutrino-intel-opt-openmp-HSW shown here was a problem with SLURM on 'mutrino' showing the error:

srun: fatal: Invalid user id: 84966

The failure (not timeout) in the build shown here showed that same 'mutrino' SLURM error:

srun: fatal: Invalid user id: 84966

Therefore, since there have been no timeouts of this test on any system since 9/21/2018 except in the the build Trilinos-atdm-mutrino-intel-opt-openmp-KNL (which is being handled #3864), I think it is safe to close this issue.

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Dec 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: KokkosKernels type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants