Set up a CUDA build for an auto PR build #2464

bartlettroscoe · 2018-03-27T20:52:55Z

CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott, @nmhamster

Description

This Issue is to scope out and track efforts to set up a CUDA build of Trilinos to be used as an auto PR build as described in #2317 (comment).

For this build it was agreed to use that ATDM build on white that is currently running and submitting to CDash. Questions about how to extend this build to be used as an auto PR build include:

Should all of the PT package tests be built and run or just the ones that the current ATDM build of Trilinos builds and runs? (Things can be set up to either way.)
Does the machine 'white' have enough computing capacity in order to handle the load of builds needed for Trilinos PR testing?
Are the Jenkins jobs running the builds using the bsub command robust enough to be a reliable PR build?

Tasks:

Clean up the existing CUDA build on white until it is 100% clean [Done]
Set up an all-at-once nightly build that enables all PT package that submits to CDash "Specialized" [Done]
Clean up the all-at-once nightly build for all PT packages (disable whatever should be disabled) ...
???

Related Issues:

Part of: Select set of builds for initial mandatory auto PR testing process #2317

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2018-03-27T21:12:12Z

FYI: We asked @nmhamster about using the rhel7F nodes on white for auto PR builds and he said we could try this.

Note that the target for this build should be the Trilinos-atdm-white-ride-cuda-debug build and not the Trilinos-atdm-white-ride-cuda-opt build due to the large number of segfaulting tests on the latter build described in #2454. I will focus on cleaning up the cuda-debug build as there are just a few failing tests at this point.

Other than setting up the Jenkins job and cleaning up any Trilinos failures with this setup, my biggest concern is the stability of the Jenkins jobs on 'white'. For example, if you look at the history of the nightly build Trilinos-atdm-white-ride-cuda-debug on white shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-debug&field2=site&compare2=63&value2=white&field3=buildstarttime&compare3=84&value3=now

only gets through all 25 of the packages (using the package-by-package method) about half the time. That is not great reliability for an auto PR build.

Perhaps the all-at-once configure, build, test and submit will be more robust? Out nightly build will tell that story.

mhoemmen · 2018-03-27T22:54:42Z

I would like a CUDA build mainly to prevent people from breaking CUDA and walking away. On the other hand, I would rather have PR testing without CUDA, than no PR testing :-) .

bartlettroscoe · 2018-03-27T23:06:01Z

I would like a CUDA build mainly to prevent people from breaking CUDA and walking away. On the other hand, I would rather have PR testing without CUDA, than no PR testing :-) .

As we discussed at the meeting last Thursday, getting the CUDA build set up will not block moving to the new auto PR system as the way to push to Trilinos. The only build that has to be right is the GCC 4.8.4 to replace and protect the current CI build (see #2462). I may work to set up this CUDA build as a post-push CI build until running jobs on 'white' is stable enough for an auto PR build. But the problem is, when the post-push CUDA CI build breaks, who is going to make sure it gets cleaned up ASAP? Right now, that looks to be me so I am super motivated to get a CUDA build running as part of auto PR testing.

We need to write up the transition plan for moving to the auto PR system so there is no confusion about things like this.

…ds on white/ride (#2466) This will pave the way for adding this build as an auto PR build (see #2464).

mhoemmen · 2018-03-28T03:32:32Z

I will also invest time in Tpetra-related CUDA issues and other issues that my ATDM and Sierra customers care about.

bartlettroscoe · 2018-03-28T22:54:33Z

NOTE: The cuda-debug failures that occurred in the ATDM builds of Trilinos described in #2471 when KOKKOS_ENABLE_DEBUG=ON was set is more motivation that this auto PR CUDA build should be a cuda-debug build and not a cuda-opt build. That is, we should have debug-mode checking enabled. This is not a performance build of Trilinos but a correctness build.

bartlettroscoe · 2018-04-03T18:50:19Z

Note that with #2471 now resolved, the only impediment to using the ATDM Trilinos CUDA-debug build on 'white' as an auto PR build is to get the bsub command to stop terminating early on 'white'. I am meeting with Nathan G. on the Test Bed team today to discuss this problem.

…2464) All three of these builds were 100% clean on the CDash "Specialized" Group today. Note that the build Trilinos-atdm-white-ride-cuda-debug being clean now means that it can be used for automated PR testing for Trilinos.

This will show how long it takes to do an all-at-once build and we can see if 'bsub' crashes in the middle of a long build or not (or if just crashes in the middle of running tests). This it see if this CUDA build on 'white' is a viable options for an auto PR test build of Trilinos (see #2464).

bartlettroscoe · 2018-04-06T00:57:43Z

Status update ...

The Trilinos-atdm-white-ride-cuda-debug-all-at-once build is 100% clean and was promoted to the "ATDM" CDash Group/Track on 4/3/2018 and completed all 25 packages today.

I set up an all-at-once version of this build in:

https://jenkins-son.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-white-ride-cuda-debug-all-at-once/

and I fired it off to submit to CDash.

We will see how long an all-at-once build for this cuda-debug build takes.

…evelop * 'develop' of https:/trilinos/Trilinos: (28 commits) MueLu: possible fix for issue trilinos#2340 MueLu: fix warnings MueLu ParameterListInterpreter test: Check return value of sed call MueLu ParameterListInterpreter test: Update gold files for SC=complex ShyLU: doc/intro rm unneeded file ShyLU: Add template for intro guide MiniTensor: Minor cosmetic changes. Add all-at-once cuda-debug build for white (trilinos#2464) Allow Trilinos_TRACK to already be set in env (trilinos#2511, TRIL-171) Fixes issue with Teuchos::SerialDenseMatrix randomization in parallel Sacado: Add explicit include for Cuda/Vectorization.hpp MueLu: And even working... MiniTensor: Added warning features, acceptable residual criteria, and stagnation checks MueLu: Additions compile now Promote newly passing builds to "ATDM" CDash Track/Group (TRIL-171, trilinos#2464) Set mpiexec --mca orte_abort_on_non_zero_status 0 ... (TRIL-198) Add ATDM_CONFIG_MPI_PRE_FLAGS (TRIL-198) Stokhos: Add specialization of Belos::MultiVecTraits for Xpetra. Allow MPI_EXEC to be specialized by platform/build setup (TRIL-198) Use default value in ATDM_SET_ATDM_VAR_FROM_ENV_AND_DEFAULT() (TRIL-171) ...

To make this happen, I had to copy ATDMDevEnv.cmake to ATDMDevEnvSettings.cmake and take out the include of ATDMDisables.cmake. I then made the file ATDMDevEnv.cmake include ATDMDevEnvSettings.cmake and ATDMDisables.cmake. I then made the ctest -S driver respond to the cache var ATDM_CONFIG_ENABLE_ALL_PACKAGE which results in allowing all of the Primary Tested packages to be enabled. The new build I added Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once enables all Primary Tested packages in Trilinos, including the onces that EMPIRE did not enable. However, the configure of Trilinos for this build currently fails with the error: Processing enabled package: ShyLU_Node (Tacho, Tests, Examples) CMake Error at packages/shylu/shylu_node/tacho/CMakeLists.txt:8 (MESSAGE): ShyLu/Tacho requires CUDA relocatable device code to be enabled if CUDA is enabled. Set: Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON This will need to be fixed after this starts submitting to CDash.

bartlettroscoe · 2018-04-11T13:46:04Z

FYI: I set up a all-at-once cuda-debug build for Trilinos for all 53 Primary Tested packages on 'white' and 'ride' submitting to CDash as the build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once as shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-04-11&filtercount=1&showfilters=1&field1=buildname&compare1=63&value1=-atdm-

This currently fails the configure of ShyLU_Node as shown at:

https://testing-vm.sandia.gov/cdash/viewConfigure.php?buildid=3415664

showing:

Processing enabled package: ShyLU_Node (Tacho, Tests, Examples)
CMake Error at packages/shylu/shylu_node/tacho/CMakeLists.txt:8 (MESSAGE):
  ShyLu/Tacho requires CUDA relocatable device code to be enabled if CUDA is
  enabled.  Set: Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON

The current ATDM build of Trilinos does not enable ShyLU so those builds are not showing that configure failure.

But note that SPARC does use ShyLU_Node (at at least some of its subpackages get enabled).

Therefore, for now, I would recommend that we disable ShyLU_Node in this initial PT CUDA build of Trilinos targeted for PR testing (but not disable ShyLU_Node or anything else in the other auto PR builds). Getting something up is better than nothing for auto PR testing.

@william76 and @jwillenbring, do you agree?

srajama1 · 2018-04-11T16:31:48Z

@bartlettroscoe : Nope. Tacho is the primary test case for Kokkos tasking. It has exposed lot of subtle issues before. I wouldn't disable it, but I would set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON which is a requirement set by Kokkos. I believe we need tasking for 1 out of the 2 ATDM applications.

ibaned · 2018-04-11T16:39:46Z

@srajama1 there is a bug in the NVIDIA linker which prevents many Trilinos packages from compiling with relocatable device code on (with tests enabled). This will probably limit how much of Trilinos we are able to test in that configuration.

srajama1 · 2018-04-11T16:44:53Z

@ibaned : Didn't know this linker bug. Do you know which feature of Kokkos tasking requires relocatable device code ?

ibaned · 2018-04-11T17:56:49Z

@srajama1 I have been told the entire system known as "Kokkos tasking" (e.g. the spawn-based system) requires relocatable device code. I don't know in detail what parts would break if we don't have it enabled. This is a difficult trade-off, but at the moment that is the situation. I think @rppawlo has built reasonable subsets of Trilinos with relocatable device code, but maybe not with tests enabled.

srajama1 · 2018-04-11T18:00:11Z

Ah, I believe most of the use cases that are needed by this PR (PR testing on GPUs for ATDM apps) would have been covered by @rppawlo . I believe we should be able to use relocatable device code in that case, assuming tests work.

bartlettroscoe · 2018-04-11T18:19:30Z

As of now, the EMPIRE configuration of Trilinos (which this current ATDM Trilinos configuration is matching) does NOT setting Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON (hence the error shown above).

But it looks like some of the SPARC configurations of Trilinos do set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON as shown by:

$ cd <sparc-tpl-base-dir>/
$ grep -nH Kokkos_ENABLE_Cuda_Relocatable_Device_Code *.sh
do-cmake_trilinos_cee-gpu_cuda_gcc_openmpi.sh:158:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=OFF \
do-cmake_trilinos_ride-gpu_gcc_cuda_openmpi.sh:147:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON \
do-cmake_trilinos_shiller-gpu_gcc_cuda_openmpi.sh:157:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON \

@micahahoward,

Is SPARC really using relocatable device code with Kokkos on 'ride'? Does this work with all of the Trilinos packages currently used by SPARC?

@rppawlo and @nmhamster,

It it worth trying to set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON in this full CUDA configuration of Trilinos which includes Phalanx, Panzer and other packages (that are not used by SPARC) on 'white'/'ride'?

rppawlo · 2018-04-11T18:22:06Z

I have only built up to phalanx testing with relocatable device code enabled. It was for experimenting with the device DAG support for assembly in phalanx. I did not test panzer, Tpetra or the linear solver stack as this only involved assembly. We only came across one real build issue in sacado due to a static variable for the kokkos memory pool. We needed an ifdef to change the static declaration depending on whether RDC was enabled. Everything else seemed to work fine. I would give it a shot.

micahahoward · 2018-04-11T18:47:55Z

Short answer: no on using RDC with SPARC.

We have issues with RDC in SPARC. I've backed this off in our Trilinos config but haven't pushed those changes to our sparc/Trilinos repo.

bartlettroscoe · 2019-01-24T14:59:55Z

CC: @srajama1

@trilinos/framework, please make sure the new CUDA PR build on 'ride' enables the ShyLU_DD package. As of today, it should be 100% clean in that CUDA 9.2 build on 'ride' (see #3541). We need the CUDA PR build to protect the ShyLU_DD packages's CUDA build.

bartlettroscoe · 2019-02-06T14:52:59Z

FYI: While reviewing PR #4332, I just noticed that there is now a Trilinos_pullrequest_cuda_9.2-83 build as shown in #4332 (comment). This is a PR that changes Teuchos so it should test everything downstsream in Trilinos. Looking at the configure output on CDash here, we can see the following explicit disables:

Explicitly disabled packages on input (by user or by default):  Claps SEACAS Trios Komplex TriKota Moertel PyTrilinos NewPackage 8

This shows that SEACAS is being disabled. Since SEACAS plays a critical role in ATDM, we need to get SEACAS enables ASAP to protect ATDM and other important customers. This build should 100% match the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt which builds and tests SEACAS just fine as shown, for example, today here.

As I told @jwillenbring, I will look into what the problem with this CUDA PR build is and get it to match the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt and then create a PR to enable SEACAS again.

bartlettroscoe · 2019-03-08T14:42:53Z

FYI: I am working on fixing the problems in the Trilinos CUDA PR build. A few things I am seeing right away:

The KOKKOS_ARCH was wrong. It was setting 'Power9' when the correct arch is 'Power8,Kepler37'. (The KOKKOS_ARCH can be seen on CDash in configure output.) This likely explained the set of the Tpetra tests that get disabled in the PR build and perhaps why STK tests got disabled (also since BoostLib was disabled). (There are no failing Tpetra tests in the matching ATDM Trilinos PR build).
The wrong set of TPLs were enabled. (The correct set can seen configure output on CDash)
The libraries for Netcdf where using "${<TPL_NAME>_ROOT}" instead of "$ENV{<TPL_NAME>_ROOT}". The former does not read env vars in CMake so there were not pointing to any real libraries on the system.

I will run the configures and builds and get these to match up. Once I have everything matched up and the correct set of test disables added, I will post a PR to merge in this configuration. I will also provide detailed instructions on how I did this so others can copy this process in the future for future PR builds.

…ilinos config (trilinos#2464) Several issues were fixed: * The correct KOKKOS_ARCH is now set * The correct set of TPLs is now enabled * The TPL include in libs are now set correctly * Several other critical options were set * Disables for already known failing tests where set This was done by simply diffing the cmake STDOUT and CMakeCache.txt files. Details will be provied in a comment in trilinos#2464.

…ilinos#4551, trilinos#2464) Has to be disabled for the CUDA PR build. Note, before this, no STK tests were being enabled at all.

bartlettroscoe · 2019-03-09T15:27:50Z

I just posed PR #4592 which fixes the CUDA PR build. It enables all of SEACAS and STK (and all their tests) and it enables all 160 Panzer BASIC tests. And they all pass (except for one recent known STK test failure with CUDA described in #4551). The process I used to fix the build was pretty simple and is described in detail below (so that others can follow a similar process in the future). With all the configure iterations, it took about 4 hours to complete this matching (mostly because the configure is slow in the NFS mounted drive on 'ride'). Once I got the configure diffs to match up, the build and tests ran right out of the box (with the one expected failing STK test).

Details on the process to fix the CUDA PR build to match the ATDM Trilinos build

First, I set up the build dir:

/home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/cuda-9.2-gnu-7.2.0-release-debug-pt/

with the files load-env.sh:

source /home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh \
  cuda-9.2-gnu-7.2.0-release-debug-pt

and do-configure:

cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DKokkos_ENABLE_Profiling=ON \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_TEST_CATEGORIES=BASIC \
-DTrilinos_TRACE_ADD_TEST=ON \
-DDART_TESTING_TIMEOUT:STRING=600.0 \
-DTrilinos_ENABLE_CONFIGURE_TIMING=ON \
"$@" \
/home/rabartl/Trilinos.base/Trilinos

(NOTE: I set some of these options to better match the Trilinos CUDA PR build settings.)

I ran the base configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/cuda-9.2-gnu-7.2.0-release-debug-pt/

$ . load-env.sh
Hostname 'ride6' matches known ATDM host 'ride' and system 'ride'
Setting compiler and build options for buld name 'cuda-9.2-gnu-7.2.0-release-debug-pt'
Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37

$ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON \
  -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON \
  &> configure.out

real    6m34.854s
user    2m56.180s
sys     1m19.912s

(NOTE: The Trilinos CUDA PR build always set Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON which is actually not desirable if you just want to reproduce you one package build.)

I then set up a configure and build directory for the Trilinos CUDA PR configuration:

/home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

with files load-env.sh:

module purge
export WORKSPACE=/home/rabartl/Trilinos.base
source /home/rabartl/Trilinos.base/Trilinos/cmake/std/sems/PullRequestCuda9.2TestingEnv.sh

and do-configure:

cmake \
-GNinja \
-C /home/rabartl/Trilinos.base/Trilinos/cmake/std/PullRequestLinuxCuda9.2TestingSettings.cmake \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DTrilinos_TRACE_ADD_TEST=ON \
-DDART_TESTING_TIMEOUT:STRING=600.0 \
"$@" \
/home/rabartl/Trilinos.base/Trilinos

I ran the configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

$ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out

real    4m58.255s
user    2m35.155s
sys     0m26.741s

I created the script create_normalized_cmake_output_files.sh:

#!/bin/bash

# Get build-dir name from argument
BUILD_DIR_NAME=$1
echo "BUILD_DIR_NAME='${BUILD_DIR_NAME}'"

set -x

cat CMakeCache.txt | grep -v "^$" | grep -v "^//" | grep -v "^#" | sort \
  > CMakeCache.normalized.txt

~/Trilinos.base/Trilinos/commonTools/refactoring/token-replace.pl \
  ${BUILD_DIR_NAME} GENERIC_BUILD_DIR \
   CMakeCache.normalized.txt CMakeCache.normalized.txt

~/Trilinos.base/Trilinos/commonTools/refactoring/token-replace.pl \
  ${BUILD_DIR_NAME} GENERIC_BUILD_DIR \
   configure.out configure.normalized.out

I then ran it as:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ cd cuda-9.2-gnu-7.2.0-release-debug-pt/

$ ../../../create_normalized_cmake_output_files.sh cuda-9.2-gnu-7.2.0-release-debug-pt

$ cd ..

$ cd pull-request-cuda-9.2/

$ ../../../create_normalized_cmake_output_files.sh pull-request-cuda-9.2

$ cd ...

I then compare the two sets of files with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ diff \
    cuda-9.2-gnu-7.2.0-release-debug-pt/configure.normalized.out \
    pull-request-cuda-9.2/configure.normalized.out \
  | less

$ diff \
    cuda-9.2-gnu-7.2.0-release-debug-pt/CMakeCache.normalized.txt \
    pull-request-cuda-9.2/CMakeCache.normalized.txt \
  | less

One difference I noted was:

74c50
< Explicitly disabled packages on input (by user or by default):  Claps Trios TriKota NewPackage 4
---
> Explicitly disabled packages on input (by user or by default):  Claps Trios TriKota PyTrilinos NewPackage 5

It is not necessary to explicitly disable PyTrilinos in this context because it is not a Primary Tested package (so it would not get enabled). But that should be harmless as it will not impact the final set of enabled and non-enabled SE Packages and TPLs.

Using the script configure-pr-build-and-diff.sh:

#!/bin/bash

cd pull-request-cuda-9.2/

. load-env.sh

rm -r CMake*
time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out

../../../create_normalized_cmake_output_files.sh pull-request-cuda-9.2

cd ..

diff \
  cuda-9.2-gnu-7.2.0-release-debug-pt/configure.normalized.out \
  pull-request-cuda-9.2/configure.normalized.out \
  | less

diff \
  cuda-9.2-gnu-7.2.0-release-debug-pt/CMakeCache.normalized.txt \
  pull-request-cuda-9.2/CMakeCache.normalized.txt \
  | less

I did several of iterations of modifying the file PullRequestLinuxCuda9.2TestingSettings.cmake and running:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ configure-pr-build-and-diff.sh

and carefully inspecting the diffs until I got them pretty close. The final diffs are shown in :

The diffs that remained should not impact what builds and what passes and fails.

I then did a full build and ran the test suite on 'ride' using the script run_all.sh:

#!/bin/bash -e
. load-env.sh
rm -r CMake* || echo "no CMake files to remove!"
time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out
time ninja -j64 &> make.out
time ctest -j8 &> ctest.out

and I ran this with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

$ bsub -x -Is -q rhel7F -n 16 ./run_all.sh

***Forced exclusive execution
Job <854522> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on ride12>>
rm: cannot remove ‘CMake*’: No such file or directory
no CMake files to remove!

real    4m19.775s
user    2m36.281s
sys     0m36.791s

real    172m17.763s
user    8615m25.206s
sys     865m48.777s

That gave the test results:

99% tests passed, 1 tests failed out of 2936

Subproject Time Summary:
Amesos                    =  27.54 sec*proc (13 tests)
Amesos2                   =  35.31 sec*proc (8 tests)
Anasazi                   = 332.37 sec*proc (74 tests)
AztecOO                   =  26.23 sec*proc (17 tests)
Belos                     = 415.65 sec*proc (100 tests)
Domi                      = 232.67 sec*proc (125 tests)
Epetra                    =  85.45 sec*proc (63 tests)
EpetraExt                 =  26.28 sec*proc (10 tests)
FEI                       =  43.59 sec*proc (43 tests)
Galeri                    =  12.38 sec*proc (9 tests)
GlobiPack                 =   2.83 sec*proc (6 tests)
Ifpack                    =  99.66 sec*proc (48 tests)
Ifpack2                   = 363.30 sec*proc (45 tests)
Intrepid                  = 383.97 sec*proc (143 tests)
Intrepid2                 = 544.01 sec*proc (267 tests)
Isorropia                 =  13.20 sec*proc (6 tests)
Kokkos                    = 170.39 sec*proc (27 tests)
KokkosKernels             = 167.29 sec*proc (8 tests)
ML                        =  75.86 sec*proc (34 tests)
MiniTensor                =   3.52 sec*proc (2 tests)
MueLu                     = 2782.78 sec*proc (105 tests)
NOX                       = 290.93 sec*proc (106 tests)
OptiPack                  =   7.93 sec*proc (5 tests)
Panzer                    = 8737.40 sec*proc (163 tests)
Phalanx                   =  19.35 sec*proc (27 tests)
Pike                      =   3.77 sec*proc (7 tests)
Piro                      =  47.12 sec*proc (13 tests)
ROL                       = 1306.07 sec*proc (164 tests)
RTOp                      =  19.57 sec*proc (24 tests)
Rythmos                   =  68.87 sec*proc (83 tests)
SEACAS                    =  22.87 sec*proc (23 tests)
STK                       =  95.71 sec*proc (15 tests)
Sacado                    = 169.57 sec*proc (300 tests)
Shards                    =   1.42 sec*proc (4 tests)
ShyLU_DD                  = 300.03 sec*proc (37 tests)
Stokhos                   = 156.84 sec*proc (84 tests)
Stratimikos               =  37.45 sec*proc (39 tests)
Teko                      = 592.89 sec*proc (18 tests)
Tempus                    = 402.99 sec*proc (80 tests)
Teuchos                   = 161.02 sec*proc (137 tests)
Thyra                     = 102.25 sec*proc (82 tests)
Tpetra                    = 1158.23 sec*proc (201 tests)
TrilinosCouplings         =  32.43 sec*proc (22 tests)
TrilinosFrameworkTests    =   5.50 sec*proc (4 tests)
Triutils                  =   3.83 sec*proc (2 tests)
Xpetra                    = 262.52 sec*proc (18 tests)
Zoltan                    = 345.53 sec*proc (14 tests)
Zoltan2                   = 553.05 sec*proc (111 tests)

Total Test time (real) = 2654.55 sec

The following tests FAILED:
	2036 - STKUnit_tests_stk_ngp_test_utest_MPI_4 (Failed
Errors while running CTest

See, we now have 23 SEACAS tests, 15 STK tests and 163 Panzer tests! Before there were only 60 Panzer tests as shown, for example, in this recent CUDA PR build and no SEACAS or STK tests.

The only failing test was STKUnit_tests_stk_ngp_test_utest_MPI_4 which is already known to be failing as described in #4551. Therefore, I added a disable for that test as well. To check that disable I did a new configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/
$ . load-env.sh
$ cmake . &> configure.reconfig.out

which showed:

$ grep STKUnit_tests_stk_ngp_test_utest_MPI_4 configure.reconfig.out 
-- STKUnit_tests_stk_ngp_test_utest_MPI_4: Added test (BASIC, NUM_MPI_PROCS=4, PROCESSORS=4)!

I then cleaned up the commits and created the PR #4592.

alanw0 · 2019-03-09T16:48:13Z

As I said on the other issue, we will try to get a stk update in, to fix the failing test, asap.

bartlettroscoe · 2019-03-09T20:03:52Z

@trilinos/framework,

With more code being enabled in the CUDA PR build due to the changes in PR #4592 the build times have gone up a lot as shown in the PR build #4592 on CDash. That shows a build time of 4h 49s! But if you look at the history of the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt this is supposed to be duplicating over the last 3 weeks you can see the build times are just under 3 hours.

This must mean that the number of build processes is not correct. If you look at the Jenkins build for these at:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/

and look at the output for example at:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/80/consoleFull

you can see:

03:03:07 -- CTEST_BUILD_FLAGS='-j64 -k 999999'
...
03:03:07 -- CTEST_PARALLEL_LEVEL='8'

So you want to set 64 build processes and 8 parallel ctest MPI processes (i.e. ctest -j8).

bartlettroscoe · 2019-03-09T21:05:05Z

CC: @dridzal, @rppawlo

@trilinos/framework,

The updated CUDA PR build in PR #4592 failed with 6 failing tests, 1 timing-out ROL test, 3 timing out Panzer tests and 2 failing Panzer tests that show CUDA allocation "out of memory" failures. This is likely due to using too high of a parallel level with ctest -j<N>. From looking at the Jenkins output for this build here it showed:

Parallel level           = 29

If that means that is is using ctest -j29 that is way too high. This needs to be lowered to ctest -j8 as described above. That will increase the test wallclock time a little but it will result in all passing tests.

Can someone on the @trilinos/framework team please update this CUDA PR build to use 64 parallel processes and only 8 parallel ctest processes on 'ride'? That will result in a total wall-clock time of a little over 4 hours in the worst case where now the CUDA PR build looks like it takes almost 5 hours and results in failing and timing out tests.

…igh (trilinos#2464) Current the Trilinos CUDA PR build is running on 'ride' with `ctest -j29`. This causes tests to timeout and crash due to running out of CUDA memory. This job needs to be reduced to only use `ctest -j8`.

…r-build-config Automatically Merged using Trilinos Pull Request AutoTester PR Title: Fix CUDA PR build to enable SEACAS, STK, and 103 extra Panzer tests (#2464) PR Author: bartlettroscoe

…s:develop' (625e220). * trilinos-develop: Temp disable some tests failing becuase ctest parallel level is too high (trilinos#2464) Ifpack2 - use KOKKOS_RESTRICT Ifpack2 - remove shadow warning Ifpack2 - add static inline to remove multiple definition of functions Disable known failing test STKUnit_tests_stk_ngp_test_utest_MPI_4 (trilinos#4551, trilinos#2464) WIP: Update CUDA PR build settings to correctly match working ATDM Trilinos config (trilinos#2464) No need to set new AAO features after cmake 3.10.0 upgrade (trilinos#1761) Ifpack2 - fix a typo trilinos#4388 Ifpack2 - change vector loop Ifpack2 - check point for debugging Ifpack2 - put profilier stop at the beginning of test Ifpack2 - little bit of improvement on extract part Ifpack2 - remove unused impl KokkosBatched - ifpack2 need some new functions from updated kokkoskernels Ifpack2 - improvement on block spmv Ifpack2 - for jacobi solver, invert diagonals and solve with gemv Ifpack2 - improvement by using large team size

bartlettroscoe · 2019-03-29T11:59:44Z

@trilinos/framework, is this done done? Has the parallel test level been reduced to 8 (or so) and have all the temp disables been removed?

bartlettroscoe · 2019-05-03T22:42:05Z

This has been done for a while. The CUDA PR build looks to be one of the most robust PR builds being used. Closing as complete.

nmhamster · 2019-05-03T22:46:33Z

Horray!

bartlettroscoe mentioned this issue Mar 27, 2018

Select set of builds for initial mandatory auto PR testing process #2317

Closed

bartlettroscoe mentioned this issue Mar 28, 2018

Address failing testing test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 in the debug builds on Power8 white and ride and Power9 waterman #2466

Closed

bartlettroscoe added a commit that referenced this issue Mar 28, 2018

Disable test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 on 'debug' buil…

a68547f

…ds on white/ride (#2466) This will pave the way for adding this build as an auto PR build (see #2464).

This was referenced Mar 28, 2018

New failing tests in ATDM debug builds of Trilinos due to KOKKOS_ENABLE_DEBUG=ON being set #2471

Closed

Kokkos_ENABLE_Debug_Bounds_Check=ON and KOKKOS_ENABLE_DEBUG=ON by default if Trilinos_ENABLE_DEBUG=ON #2439

Closed

This was referenced Mar 28, 2018

Tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 failing in 'debug' builds on white/ride #2473

Closed

Test TeuchosComm_TimeMonitor_UnitTests_MPI_3 randomly failing in ATDM builds of Trilinos #2487

Closed

bartlettroscoe added type: enhancement Issue is an enhancement, not a bug client: ATDM Any issue primarily impacting the ATDM project labels Apr 3, 2018

bartlettroscoe added this to the Improve productivity, stability, and quality of Trilinos milestone Apr 3, 2018

jwillenbring mentioned this issue Jan 29, 2019

Disable Tpetra test for CUDA PR build #4293

Merged

ZUUL42 mentioned this issue Feb 14, 2019

Reenable CUDA options #4397

Merged

9 tasks

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019

WIP: Fix CUDA PR build settings (trilinos#2464)

057a183

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019

WIP: More matching of the ATDM CUDA build (trilinos#2464)

55cff7f

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019

WIP: Match more setting (trilinos#2464)

49562a0

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019

WIP: More options matching (trilinos#2464)

6d856b6

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019

WIP: More options matching (trilinos#2464)

7a4f89f

bartlettroscoe mentioned this issue Mar 9, 2019

Fix CUDA PR build to enable SEACAS, STK, and 103 extra Panzer tests (#2464) #4592

Merged

bartlettroscoe added the ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams label Mar 9, 2019

This was referenced Mar 13, 2019

PR test timeouts preventing PRs from passing #4614

Closed

Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 failing in ATDM cuda builds #4622

Closed

Build and test failures in ATDM RDC builds on white and waterman #4502

Closed

bartlettroscoe closed this as completed May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up a CUDA build for an auto PR build #2464

Set up a CUDA build for an auto PR build #2464

bartlettroscoe commented Mar 27, 2018 •

edited

Loading

bartlettroscoe commented Mar 27, 2018

mhoemmen commented Mar 27, 2018

bartlettroscoe commented Mar 27, 2018

mhoemmen commented Mar 28, 2018

bartlettroscoe commented Mar 28, 2018

bartlettroscoe commented Apr 3, 2018

bartlettroscoe commented Apr 6, 2018

bartlettroscoe commented Apr 11, 2018 •

edited

Loading

srajama1 commented Apr 11, 2018 •

edited

Loading

ibaned commented Apr 11, 2018

srajama1 commented Apr 11, 2018

ibaned commented Apr 11, 2018

srajama1 commented Apr 11, 2018

bartlettroscoe commented Apr 11, 2018

rppawlo commented Apr 11, 2018

micahahoward commented Apr 11, 2018

bartlettroscoe commented Jan 24, 2019

bartlettroscoe commented Feb 6, 2019

bartlettroscoe commented Mar 8, 2019

bartlettroscoe commented Mar 9, 2019 •

edited

Loading

alanw0 commented Mar 9, 2019

bartlettroscoe commented Mar 9, 2019 •

edited

Loading

bartlettroscoe commented Mar 9, 2019

bartlettroscoe commented Mar 29, 2019 •

edited

Loading

bartlettroscoe commented May 3, 2019

nmhamster commented May 3, 2019

Set up a CUDA build for an auto PR build #2464

Set up a CUDA build for an auto PR build #2464

Comments

bartlettroscoe commented Mar 27, 2018 • edited Loading

Description

Tasks:

Related Issues:

bartlettroscoe commented Mar 27, 2018

mhoemmen commented Mar 27, 2018

bartlettroscoe commented Mar 27, 2018

mhoemmen commented Mar 28, 2018

bartlettroscoe commented Mar 28, 2018

bartlettroscoe commented Apr 3, 2018

bartlettroscoe commented Apr 6, 2018

bartlettroscoe commented Apr 11, 2018 • edited Loading

srajama1 commented Apr 11, 2018 • edited Loading

ibaned commented Apr 11, 2018

srajama1 commented Apr 11, 2018

ibaned commented Apr 11, 2018

srajama1 commented Apr 11, 2018

bartlettroscoe commented Apr 11, 2018

rppawlo commented Apr 11, 2018

micahahoward commented Apr 11, 2018

bartlettroscoe commented Jan 24, 2019

bartlettroscoe commented Feb 6, 2019

bartlettroscoe commented Mar 8, 2019

bartlettroscoe commented Mar 9, 2019 • edited Loading

Details on the process to fix the CUDA PR build to match the ATDM Trilinos build

alanw0 commented Mar 9, 2019

bartlettroscoe commented Mar 9, 2019 • edited Loading

bartlettroscoe commented Mar 9, 2019

bartlettroscoe commented Mar 29, 2019 • edited Loading

bartlettroscoe commented May 3, 2019

nmhamster commented May 3, 2019

bartlettroscoe commented Mar 27, 2018 •

edited

Loading

bartlettroscoe commented Apr 11, 2018 •

edited

Loading

srajama1 commented Apr 11, 2018 •

edited

Loading

bartlettroscoe commented Mar 9, 2019 •

edited

Loading

bartlettroscoe commented Mar 9, 2019 •

edited

Loading

bartlettroscoe commented Mar 29, 2019 •

edited

Loading