Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up a CUDA build for an auto PR build #2464

Closed
bartlettroscoe opened this issue Mar 27, 2018 · 74 comments
Closed

Set up a CUDA build for an auto PR build #2464

bartlettroscoe opened this issue Mar 27, 2018 · 74 comments
Assignees
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area system: gpu type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 27, 2018

CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott, @nmhamster

Description

This Issue is to scope out and track efforts to set up a CUDA build of Trilinos to be used as an auto PR build as described in #2317 (comment).

For this build it was agreed to use that ATDM build on white that is currently running and submitting to CDash. Questions about how to extend this build to be used as an auto PR build include:

  • Should all of the PT package tests be built and run or just the ones that the current ATDM build of Trilinos builds and runs? (Things can be set up to either way.)
  • Does the machine 'white' have enough computing capacity in order to handle the load of builds needed for Trilinos PR testing?
  • Are the Jenkins jobs running the builds using the bsub command robust enough to be a reliable PR build?

Tasks:

  1. Clean up the existing CUDA build on white until it is 100% clean [Done]
  2. Set up an all-at-once nightly build that enables all PT package that submits to CDash "Specialized" [Done]
  3. Clean up the all-at-once nightly build for all PT packages (disable whatever should be disabled) ...
  4. ???

Related Issues:

@bartlettroscoe
Copy link
Member Author

FYI: We asked @nmhamster about using the rhel7F nodes on white for auto PR builds and he said we could try this.

Note that the target for this build should be the Trilinos-atdm-white-ride-cuda-debug build and not the Trilinos-atdm-white-ride-cuda-opt build due to the large number of segfaulting tests on the latter build described in #2454. I will focus on cleaning up the cuda-debug build as there are just a few failing tests at this point.

Other than setting up the Jenkins job and cleaning up any Trilinos failures with this setup, my biggest concern is the stability of the Jenkins jobs on 'white'. For example, if you look at the history of the nightly build Trilinos-atdm-white-ride-cuda-debug on white shown at:

only gets through all 25 of the packages (using the package-by-package method) about half the time. That is not great reliability for an auto PR build.

Perhaps the all-at-once configure, build, test and submit will be more robust? Out nightly build will tell that story.

@mhoemmen
Copy link
Contributor

I would like a CUDA build mainly to prevent people from breaking CUDA and walking away. On the other hand, I would rather have PR testing without CUDA, than no PR testing :-) .

@bartlettroscoe
Copy link
Member Author

I would like a CUDA build mainly to prevent people from breaking CUDA and walking away. On the other hand, I would rather have PR testing without CUDA, than no PR testing :-) .

As we discussed at the meeting last Thursday, getting the CUDA build set up will not block moving to the new auto PR system as the way to push to Trilinos. The only build that has to be right is the GCC 4.8.4 to replace and protect the current CI build (see #2462). I may work to set up this CUDA build as a post-push CI build until running jobs on 'white' is stable enough for an auto PR build. But the problem is, when the post-push CUDA CI build breaks, who is going to make sure it gets cleaned up ASAP? Right now, that looks to be me so I am super motivated to get a CUDA build running as part of auto PR testing.

We need to write up the transition plan for moving to the auto PR system so there is no confusion about things like this.

@mhoemmen
Copy link
Contributor

I will also invest time in Tpetra-related CUDA issues and other issues that my ATDM and Sierra customers care about.

@bartlettroscoe
Copy link
Member Author

NOTE: The cuda-debug failures that occurred in the ATDM builds of Trilinos described in #2471 when KOKKOS_ENABLE_DEBUG=ON was set is more motivation that this auto PR CUDA build should be a cuda-debug build and not a cuda-opt build. That is, we should have debug-mode checking enabled. This is not a performance build of Trilinos but a correctness build.

@bartlettroscoe
Copy link
Member Author

Note that with #2471 now resolved, the only impediment to using the ATDM Trilinos CUDA-debug build on 'white' as an auto PR build is to get the bsub command to stop terminating early on 'white'. I am meeting with Nathan G. on the Test Bed team today to discuss this problem.

bartlettroscoe added a commit that referenced this issue Apr 4, 2018
…2464)

All three of these builds were 100% clean on the CDash "Specialized" Group
today.

Note that the build Trilinos-atdm-white-ride-cuda-debug being clean now means
that it can be used for automated PR testing for Trilinos.
bartlettroscoe added a commit that referenced this issue Apr 6, 2018
This will show how long it takes to do an all-at-once build and we can see if
'bsub' crashes in the middle of a long build or not (or if just crashes in the
middle of running tests).

This it see if this CUDA build on 'white' is a viable options for an auto PR
test build of Trilinos (see #2464).
@bartlettroscoe
Copy link
Member Author

Status update ...

The Trilinos-atdm-white-ride-cuda-debug-all-at-once build is 100% clean and was promoted to the "ATDM" CDash Group/Track on 4/3/2018 and completed all 25 packages today.

I set up an all-at-once version of this build in:

and I fired it off to submit to CDash.

We will see how long an all-at-once build for this cuda-debug build takes.

searhein pushed a commit to searhein/Trilinos that referenced this issue Apr 10, 2018
…evelop

* 'develop' of https:/trilinos/Trilinos: (28 commits)
  MueLu: possible fix for issue trilinos#2340
  MueLu: fix warnings
  MueLu ParameterListInterpreter test: Check return value of sed call
  MueLu ParameterListInterpreter test: Update gold files for SC=complex
  ShyLU: doc/intro rm unneeded file
  ShyLU: Add template for intro guide
  MiniTensor: Minor cosmetic changes.
  Add all-at-once cuda-debug build for white (trilinos#2464)
  Allow Trilinos_TRACK to already be set in env (trilinos#2511, TRIL-171)
  Fixes issue with Teuchos::SerialDenseMatrix randomization in parallel
  Sacado:  Add explicit include for Cuda/Vectorization.hpp
  MueLu: And even working...
  MiniTensor: Added warning features, acceptable residual criteria, and stagnation checks
  MueLu: Additions compile now
  Promote newly passing builds to "ATDM" CDash Track/Group (TRIL-171, trilinos#2464)
  Set mpiexec --mca orte_abort_on_non_zero_status 0 ... (TRIL-198)
  Add ATDM_CONFIG_MPI_PRE_FLAGS (TRIL-198)
  Stokhos:  Add specialization of Belos::MultiVecTraits for Xpetra.
  Allow MPI_EXEC to be specialized by platform/build setup (TRIL-198)
  Use default value in ATDM_SET_ATDM_VAR_FROM_ENV_AND_DEFAULT() (TRIL-171)
  ...
bartlettroscoe added a commit that referenced this issue Apr 11, 2018
To make this happen, I had to copy ATDMDevEnv.cmake to
ATDMDevEnvSettings.cmake and take out the include of ATDMDisables.cmake.  I
then made the file ATDMDevEnv.cmake include ATDMDevEnvSettings.cmake and
ATDMDisables.cmake.

I then made the ctest -S driver respond to the cache var
ATDM_CONFIG_ENABLE_ALL_PACKAGE which results in allowing all of the Primary
Tested packages to be enabled.

The new build I added Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once
enables all Primary Tested packages in Trilinos, including the onces that
EMPIRE did not enable.  However, the configure of Trilinos for this build
currently fails with the error:

Processing enabled package: ShyLU_Node (Tacho, Tests, Examples)
CMake Error at packages/shylu/shylu_node/tacho/CMakeLists.txt:8 (MESSAGE):
  ShyLu/Tacho requires CUDA relocatable device code to be enabled if CUDA is
  enabled.  Set: Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON

This will need to be fixed after this starts submitting to CDash.
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Apr 11, 2018

FYI: I set up a all-at-once cuda-debug build for Trilinos for all 53 Primary Tested packages on 'white' and 'ride' submitting to CDash as the build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once as shown at:

This currently fails the configure of ShyLU_Node as shown at:

showing:

Processing enabled package: ShyLU_Node (Tacho, Tests, Examples)
CMake Error at packages/shylu/shylu_node/tacho/CMakeLists.txt:8 (MESSAGE):
  ShyLu/Tacho requires CUDA relocatable device code to be enabled if CUDA is
  enabled.  Set: Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON

The current ATDM build of Trilinos does not enable ShyLU so those builds are not showing that configure failure.

But note that SPARC does use ShyLU_Node (at at least some of its subpackages get enabled).

Therefore, for now, I would recommend that we disable ShyLU_Node in this initial PT CUDA build of Trilinos targeted for PR testing (but not disable ShyLU_Node or anything else in the other auto PR builds). Getting something up is better than nothing for auto PR testing.

@william76 and @jwillenbring, do you agree?

@srajama1
Copy link
Contributor

srajama1 commented Apr 11, 2018

@bartlettroscoe : Nope. Tacho is the primary test case for Kokkos tasking. It has exposed lot of subtle issues before. I wouldn't disable it, but I would set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON which is a requirement set by Kokkos. I believe we need tasking for 1 out of the 2 ATDM applications.

@ibaned
Copy link
Contributor

ibaned commented Apr 11, 2018

@srajama1 there is a bug in the NVIDIA linker which prevents many Trilinos packages from compiling with relocatable device code on (with tests enabled). This will probably limit how much of Trilinos we are able to test in that configuration.

@srajama1
Copy link
Contributor

@ibaned : Didn't know this linker bug. Do you know which feature of Kokkos tasking requires relocatable device code ?

@ibaned
Copy link
Contributor

ibaned commented Apr 11, 2018

@srajama1 I have been told the entire system known as "Kokkos tasking" (e.g. the spawn-based system) requires relocatable device code. I don't know in detail what parts would break if we don't have it enabled. This is a difficult trade-off, but at the moment that is the situation. I think @rppawlo has built reasonable subsets of Trilinos with relocatable device code, but maybe not with tests enabled.

@srajama1
Copy link
Contributor

Ah, I believe most of the use cases that are needed by this PR (PR testing on GPUs for ATDM apps) would have been covered by @rppawlo . I believe we should be able to use relocatable device code in that case, assuming tests work.

@bartlettroscoe
Copy link
Member Author

As of now, the EMPIRE configuration of Trilinos (which this current ATDM Trilinos configuration is matching) does NOT setting Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON (hence the error shown above).

But it looks like some of the SPARC configurations of Trilinos do set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON as shown by:

$ cd <sparc-tpl-base-dir>/
$ grep -nH Kokkos_ENABLE_Cuda_Relocatable_Device_Code *.sh
do-cmake_trilinos_cee-gpu_cuda_gcc_openmpi.sh:158:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=OFF \
do-cmake_trilinos_ride-gpu_gcc_cuda_openmpi.sh:147:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON \
do-cmake_trilinos_shiller-gpu_gcc_cuda_openmpi.sh:157:   -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON \

@micahahoward,

Is SPARC really using relocatable device code with Kokkos on 'ride'? Does this work with all of the Trilinos packages currently used by SPARC?

@rppawlo and @nmhamster,

It it worth trying to set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON in this full CUDA configuration of Trilinos which includes Phalanx, Panzer and other packages (that are not used by SPARC) on 'white'/'ride'?

@rppawlo
Copy link
Contributor

rppawlo commented Apr 11, 2018

I have only built up to phalanx testing with relocatable device code enabled. It was for experimenting with the device DAG support for assembly in phalanx. I did not test panzer, Tpetra or the linear solver stack as this only involved assembly. We only came across one real build issue in sacado due to a static variable for the kokkos memory pool. We needed an ifdef to change the static declaration depending on whether RDC was enabled. Everything else seemed to work fine. I would give it a shot.

@micahahoward
Copy link

Short answer: no on using RDC with SPARC.

We have issues with RDC in SPARC. I've backed this off in our Trilinos config but haven't pushed those changes to our sparc/Trilinos repo.

@bartlettroscoe
Copy link
Member Author

CC: @srajama1

@trilinos/framework, please make sure the new CUDA PR build on 'ride' enables the ShyLU_DD package. As of today, it should be 100% clean in that CUDA 9.2 build on 'ride' (see #3541). We need the CUDA PR build to protect the ShyLU_DD packages's CUDA build.

@bartlettroscoe
Copy link
Member Author

FYI: While reviewing PR #4332, I just noticed that there is now a Trilinos_pullrequest_cuda_9.2-83 build as shown in #4332 (comment). This is a PR that changes Teuchos so it should test everything downstsream in Trilinos. Looking at the configure output on CDash here, we can see the following explicit disables:

Explicitly disabled packages on input (by user or by default):  Claps SEACAS Trios Komplex TriKota Moertel PyTrilinos NewPackage 8

This shows that SEACAS is being disabled. Since SEACAS plays a critical role in ATDM, we need to get SEACAS enables ASAP to protect ATDM and other important customers. This build should 100% match the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt which builds and tests SEACAS just fine as shown, for example, today here.

As I told @jwillenbring, I will look into what the problem with this CUDA PR build is and get it to match the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt and then create a PR to enable SEACAS again.

@ZUUL42 ZUUL42 mentioned this issue Feb 14, 2019
9 tasks
@bartlettroscoe
Copy link
Member Author

FYI: I am working on fixing the problems in the Trilinos CUDA PR build. A few things I am seeing right away:

  • The KOKKOS_ARCH was wrong. It was setting 'Power9' when the correct arch is 'Power8,Kepler37'. (The KOKKOS_ARCH can be seen on CDash in configure output.) This likely explained the set of the Tpetra tests that get disabled in the PR build and perhaps why STK tests got disabled (also since BoostLib was disabled). (There are no failing Tpetra tests in the matching ATDM Trilinos PR build).

  • The wrong set of TPLs were enabled. (The correct set can seen configure output on CDash)

  • The libraries for Netcdf where using "${<TPL_NAME>_ROOT}" instead of "$ENV{<TPL_NAME>_ROOT}". The former does not read env vars in CMake so there were not pointing to any real libraries on the system.

I will run the configures and builds and get these to match up. Once I have everything matched up and the correct set of test disables added, I will post a PR to merge in this configuration. I will also provide detailed instructions on how I did this so others can copy this process in the future for future PR builds.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 8, 2019
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 9, 2019
…ilinos config (trilinos#2464)

Several issues were fixed:

* The correct KOKKOS_ARCH is now set
* The correct set of TPLs is now enabled
* The TPL include in libs are now set correctly
* Several other critical options were set
* Disables for already known failing tests where set

This was done by simply diffing the cmake STDOUT and CMakeCache.txt files.
Details will be provied in a comment in trilinos#2464.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 9, 2019
…ilinos#4551, trilinos#2464)

Has to be disabled for the CUDA PR build.  Note, before this, no STK tests
were being enabled at all.
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 9, 2019

I just posed PR #4592 which fixes the CUDA PR build. It enables all of SEACAS and STK (and all their tests) and it enables all 160 Panzer BASIC tests. And they all pass (except for one recent known STK test failure with CUDA described in #4551). The process I used to fix the build was pretty simple and is described in detail below (so that others can follow a similar process in the future). With all the configure iterations, it took about 4 hours to complete this matching (mostly because the configure is slow in the NFS mounted drive on 'ride'). Once I got the configure diffs to match up, the build and tests ran right out of the box (with the one expected failing STK test).


Details on the process to fix the CUDA PR build to match the ATDM Trilinos build

First, I set up the build dir:

/home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/cuda-9.2-gnu-7.2.0-release-debug-pt/

with the files load-env.sh:

source /home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh \
  cuda-9.2-gnu-7.2.0-release-debug-pt

and do-configure:

cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DKokkos_ENABLE_Profiling=ON \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_TEST_CATEGORIES=BASIC \
-DTrilinos_TRACE_ADD_TEST=ON \
-DDART_TESTING_TIMEOUT:STRING=600.0 \
-DTrilinos_ENABLE_CONFIGURE_TIMING=ON \
"$@" \
/home/rabartl/Trilinos.base/Trilinos

(NOTE: I set some of these options to better match the Trilinos CUDA PR build settings.)

I ran the base configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/cuda-9.2-gnu-7.2.0-release-debug-pt/

$ . load-env.sh
Hostname 'ride6' matches known ATDM host 'ride' and system 'ride'
Setting compiler and build options for buld name 'cuda-9.2-gnu-7.2.0-release-debug-pt'
Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37

$ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON \
  -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON \
  &> configure.out

real    6m34.854s
user    2m56.180s
sys     1m19.912s

(NOTE: The Trilinos CUDA PR build always set Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON which is actually not desirable if you just want to reproduce you one package build.)

I then set up a configure and build directory for the Trilinos CUDA PR configuration:

/home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

with files load-env.sh:

module purge
export WORKSPACE=/home/rabartl/Trilinos.base
source /home/rabartl/Trilinos.base/Trilinos/cmake/std/sems/PullRequestCuda9.2TestingEnv.sh

and do-configure:

cmake \
-GNinja \
-C /home/rabartl/Trilinos.base/Trilinos/cmake/std/PullRequestLinuxCuda9.2TestingSettings.cmake \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DTrilinos_TRACE_ADD_TEST=ON \
-DDART_TESTING_TIMEOUT:STRING=600.0 \
"$@" \
/home/rabartl/Trilinos.base/Trilinos

I ran the configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

$ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out

real    4m58.255s
user    2m35.155s
sys     0m26.741s

I created the script create_normalized_cmake_output_files.sh:

#!/bin/bash

# Get build-dir name from argument
BUILD_DIR_NAME=$1
echo "BUILD_DIR_NAME='${BUILD_DIR_NAME}'"

set -x

cat CMakeCache.txt | grep -v "^$" | grep -v "^//" | grep -v "^#" | sort \
  > CMakeCache.normalized.txt

~/Trilinos.base/Trilinos/commonTools/refactoring/token-replace.pl \
  ${BUILD_DIR_NAME} GENERIC_BUILD_DIR \
   CMakeCache.normalized.txt CMakeCache.normalized.txt

~/Trilinos.base/Trilinos/commonTools/refactoring/token-replace.pl \
  ${BUILD_DIR_NAME} GENERIC_BUILD_DIR \
   configure.out configure.normalized.out

I then ran it as:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ cd cuda-9.2-gnu-7.2.0-release-debug-pt/

$ ../../../create_normalized_cmake_output_files.sh cuda-9.2-gnu-7.2.0-release-debug-pt

$ cd ..

$ cd pull-request-cuda-9.2/

$ ../../../create_normalized_cmake_output_files.sh pull-request-cuda-9.2

$ cd ...

I then compare the two sets of files with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ diff \
    cuda-9.2-gnu-7.2.0-release-debug-pt/configure.normalized.out \
    pull-request-cuda-9.2/configure.normalized.out \
  | less

$ diff \
    cuda-9.2-gnu-7.2.0-release-debug-pt/CMakeCache.normalized.txt \
    pull-request-cuda-9.2/CMakeCache.normalized.txt \
  | less

One difference I noted was:

74c50
< Explicitly disabled packages on input (by user or by default):  Claps Trios TriKota NewPackage 4
---
> Explicitly disabled packages on input (by user or by default):  Claps Trios TriKota PyTrilinos NewPackage 5

It is not necessary to explicitly disable PyTrilinos in this context because it is not a Primary Tested package (so it would not get enabled). But that should be harmless as it will not impact the final set of enabled and non-enabled SE Packages and TPLs.

Using the script configure-pr-build-and-diff.sh:

#!/bin/bash

cd pull-request-cuda-9.2/

. load-env.sh

rm -r CMake*
time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out

../../../create_normalized_cmake_output_files.sh pull-request-cuda-9.2

cd ..

diff \
  cuda-9.2-gnu-7.2.0-release-debug-pt/configure.normalized.out \
  pull-request-cuda-9.2/configure.normalized.out \
  | less

diff \
  cuda-9.2-gnu-7.2.0-release-debug-pt/CMakeCache.normalized.txt \
  pull-request-cuda-9.2/CMakeCache.normalized.txt \
  | less

I did several of iterations of modifying the file PullRequestLinuxCuda9.2TestingSettings.cmake and running:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/

$ configure-pr-build-and-diff.sh

and carefully inspecting the diffs until I got them pretty close. The final diffs are shown in :

The diffs that remained should not impact what builds and what passes and fails.

I then did a full build and ran the test suite on 'ride' using the script run_all.sh:

#!/bin/bash -e
. load-env.sh
rm -r CMake* || echo "no CMake files to remove!"
time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out
time ninja -j64 &> make.out
time ctest -j8 &> ctest.out

and I ran this with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/

$ bsub -x -Is -q rhel7F -n 16 ./run_all.sh

***Forced exclusive execution
Job <854522> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on ride12>>
rm: cannot remove ‘CMake*’: No such file or directory
no CMake files to remove!

real    4m19.775s
user    2m36.281s
sys     0m36.791s

real    172m17.763s
user    8615m25.206s
sys     865m48.777s

That gave the test results:

99% tests passed, 1 tests failed out of 2936

Subproject Time Summary:
Amesos                    =  27.54 sec*proc (13 tests)
Amesos2                   =  35.31 sec*proc (8 tests)
Anasazi                   = 332.37 sec*proc (74 tests)
AztecOO                   =  26.23 sec*proc (17 tests)
Belos                     = 415.65 sec*proc (100 tests)
Domi                      = 232.67 sec*proc (125 tests)
Epetra                    =  85.45 sec*proc (63 tests)
EpetraExt                 =  26.28 sec*proc (10 tests)
FEI                       =  43.59 sec*proc (43 tests)
Galeri                    =  12.38 sec*proc (9 tests)
GlobiPack                 =   2.83 sec*proc (6 tests)
Ifpack                    =  99.66 sec*proc (48 tests)
Ifpack2                   = 363.30 sec*proc (45 tests)
Intrepid                  = 383.97 sec*proc (143 tests)
Intrepid2                 = 544.01 sec*proc (267 tests)
Isorropia                 =  13.20 sec*proc (6 tests)
Kokkos                    = 170.39 sec*proc (27 tests)
KokkosKernels             = 167.29 sec*proc (8 tests)
ML                        =  75.86 sec*proc (34 tests)
MiniTensor                =   3.52 sec*proc (2 tests)
MueLu                     = 2782.78 sec*proc (105 tests)
NOX                       = 290.93 sec*proc (106 tests)
OptiPack                  =   7.93 sec*proc (5 tests)
Panzer                    = 8737.40 sec*proc (163 tests)
Phalanx                   =  19.35 sec*proc (27 tests)
Pike                      =   3.77 sec*proc (7 tests)
Piro                      =  47.12 sec*proc (13 tests)
ROL                       = 1306.07 sec*proc (164 tests)
RTOp                      =  19.57 sec*proc (24 tests)
Rythmos                   =  68.87 sec*proc (83 tests)
SEACAS                    =  22.87 sec*proc (23 tests)
STK                       =  95.71 sec*proc (15 tests)
Sacado                    = 169.57 sec*proc (300 tests)
Shards                    =   1.42 sec*proc (4 tests)
ShyLU_DD                  = 300.03 sec*proc (37 tests)
Stokhos                   = 156.84 sec*proc (84 tests)
Stratimikos               =  37.45 sec*proc (39 tests)
Teko                      = 592.89 sec*proc (18 tests)
Tempus                    = 402.99 sec*proc (80 tests)
Teuchos                   = 161.02 sec*proc (137 tests)
Thyra                     = 102.25 sec*proc (82 tests)
Tpetra                    = 1158.23 sec*proc (201 tests)
TrilinosCouplings         =  32.43 sec*proc (22 tests)
TrilinosFrameworkTests    =   5.50 sec*proc (4 tests)
Triutils                  =   3.83 sec*proc (2 tests)
Xpetra                    = 262.52 sec*proc (18 tests)
Zoltan                    = 345.53 sec*proc (14 tests)
Zoltan2                   = 553.05 sec*proc (111 tests)

Total Test time (real) = 2654.55 sec

The following tests FAILED:
	2036 - STKUnit_tests_stk_ngp_test_utest_MPI_4 (Failed
Errors while running CTest

See, we now have 23 SEACAS tests, 15 STK tests and 163 Panzer tests! Before there were only 60 Panzer tests as shown, for example, in this recent CUDA PR build and no SEACAS or STK tests.

The only failing test was STKUnit_tests_stk_ngp_test_utest_MPI_4 which is already known to be failing as described in #4551. Therefore, I added a disable for that test as well. To check that disable I did a new configure with:

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CUDA/pull-request-cuda-9.2/
$ . load-env.sh
$ cmake . &> configure.reconfig.out

which showed:

$ grep STKUnit_tests_stk_ngp_test_utest_MPI_4 configure.reconfig.out 
-- STKUnit_tests_stk_ngp_test_utest_MPI_4: Added test (BASIC, NUM_MPI_PROCS=4, PROCESSORS=4)!

I then cleaned up the commits and created the PR #4592.

@bartlettroscoe bartlettroscoe added the ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams label Mar 9, 2019
@alanw0
Copy link
Contributor

alanw0 commented Mar 9, 2019

As I said on the other issue, we will try to get a stk update in, to fix the failing test, asap.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 9, 2019

@trilinos/framework,

With more code being enabled in the CUDA PR build due to the changes in PR #4592 the build times have gone up a lot as shown in the PR build #4592 on CDash. That shows a build time of 4h 49s! But if you look at the history of the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt this is supposed to be duplicating over the last 3 weeks you can see the build times are just under 3 hours.

This must mean that the number of build processes is not correct. If you look at the Jenkins build for these at:

and look at the output for example at:

you can see:

03:03:07 -- CTEST_BUILD_FLAGS='-j64 -k 999999'
...
03:03:07 -- CTEST_PARALLEL_LEVEL='8'

So you want to set 64 build processes and 8 parallel ctest MPI processes (i.e. ctest -j8).

@bartlettroscoe
Copy link
Member Author

CC: @dridzal, @rppawlo

@trilinos/framework,

The updated CUDA PR build in PR #4592 failed with 6 failing tests, 1 timing-out ROL test, 3 timing out Panzer tests and 2 failing Panzer tests that show CUDA allocation "out of memory" failures. This is likely due to using too high of a parallel level with ctest -j<N>. From looking at the Jenkins output for this build here it showed:

Parallel level           = 29

If that means that is is using ctest -j29 that is way too high. This needs to be lowered to ctest -j8 as described above. That will increase the test wallclock time a little but it will result in all passing tests.

Can someone on the @trilinos/framework team please update this CUDA PR build to use 64 parallel processes and only 8 parallel ctest processes on 'ride'? That will result in a total wall-clock time of a little over 4 hours in the worst case where now the CUDA PR build looks like it takes almost 5 hours and results in failing and timing out tests.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 10, 2019
…igh (trilinos#2464)

Current the Trilinos CUDA PR build is running on 'ride' with `ctest -j29`.
This causes tests to timeout and crash due to running out of CUDA memory.
This job needs to be reduced to only use `ctest -j8`.
trilinos-autotester added a commit that referenced this issue Mar 11, 2019
…r-build-config

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Fix CUDA PR build to enable SEACAS, STK, and 103 extra Panzer tests (#2464)
PR Author: bartlettroscoe
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Mar 11, 2019
…s:develop' (625e220).

* trilinos-develop:
  Temp disable some tests failing becuase ctest parallel level is too high (trilinos#2464)
  Ifpack2 - use KOKKOS_RESTRICT
  Ifpack2 - remove shadow warning
  Ifpack2 - add static inline to remove multiple definition of functions
  Disable known failing test STKUnit_tests_stk_ngp_test_utest_MPI_4 (trilinos#4551, trilinos#2464)
  WIP: Update CUDA PR build settings to correctly match working ATDM Trilinos config (trilinos#2464)
  No need to set new AAO features after cmake 3.10.0 upgrade (trilinos#1761)
  Ifpack2 - fix a typo trilinos#4388
  Ifpack2 - change vector loop
  Ifpack2 - check point for debugging
  Ifpack2 - put profilier stop at the beginning of test
  Ifpack2 - little bit of improvement on extract part
  Ifpack2 - remove unused impl
  KokkosBatched - ifpack2 need some new functions from updated kokkoskernels
  Ifpack2 - improvement on block spmv
  Ifpack2 - for jacobi solver, invert diagonals and solve with gemv
  Ifpack2 - improvement by using large team size
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Mar 11, 2019
…s:develop' (625e220).

* trilinos-develop:
  Temp disable some tests failing becuase ctest parallel level is too high (trilinos#2464)
  Ifpack2 - use KOKKOS_RESTRICT
  Ifpack2 - remove shadow warning
  Ifpack2 - add static inline to remove multiple definition of functions
  Disable known failing test STKUnit_tests_stk_ngp_test_utest_MPI_4 (trilinos#4551, trilinos#2464)
  WIP: Update CUDA PR build settings to correctly match working ATDM Trilinos config (trilinos#2464)
  No need to set new AAO features after cmake 3.10.0 upgrade (trilinos#1761)
  Ifpack2 - fix a typo trilinos#4388
  Ifpack2 - change vector loop
  Ifpack2 - check point for debugging
  Ifpack2 - put profilier stop at the beginning of test
  Ifpack2 - little bit of improvement on extract part
  Ifpack2 - remove unused impl
  KokkosBatched - ifpack2 need some new functions from updated kokkoskernels
  Ifpack2 - improvement on block spmv
  Ifpack2 - for jacobi solver, invert diagonals and solve with gemv
  Ifpack2 - improvement by using large team size
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 29, 2019

@trilinos/framework, is this done done? Has the parallel test level been reduced to 8 (or so) and have all the temp disables been removed?

@bartlettroscoe
Copy link
Member Author

This has been done for a while. The CUDA PR build looks to be one of the most robust PR builds being used. Closing as complete.

@nmhamster
Copy link
Contributor

Horray!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area system: gpu type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests