-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up a CUDA build for an auto PR build #2464
Comments
FYI: We asked @nmhamster about using the Note that the target for this build should be the Other than setting up the Jenkins job and cleaning up any Trilinos failures with this setup, my biggest concern is the stability of the Jenkins jobs on 'white'. For example, if you look at the history of the nightly build only gets through all 25 of the packages (using the package-by-package method) about half the time. That is not great reliability for an auto PR build. Perhaps the all-at-once configure, build, test and submit will be more robust? Out nightly build will tell that story. |
I would like a CUDA build mainly to prevent people from breaking CUDA and walking away. On the other hand, I would rather have PR testing without CUDA, than no PR testing :-) . |
As we discussed at the meeting last Thursday, getting the CUDA build set up will not block moving to the new auto PR system as the way to push to Trilinos. The only build that has to be right is the GCC 4.8.4 to replace and protect the current CI build (see #2462). I may work to set up this CUDA build as a post-push CI build until running jobs on 'white' is stable enough for an auto PR build. But the problem is, when the post-push CUDA CI build breaks, who is going to make sure it gets cleaned up ASAP? Right now, that looks to be me so I am super motivated to get a CUDA build running as part of auto PR testing. We need to write up the transition plan for moving to the auto PR system so there is no confusion about things like this. |
I will also invest time in Tpetra-related CUDA issues and other issues that my ATDM and Sierra customers care about. |
NOTE: The |
Note that with #2471 now resolved, the only impediment to using the ATDM Trilinos CUDA-debug build on 'white' as an auto PR build is to get the |
…2464) All three of these builds were 100% clean on the CDash "Specialized" Group today. Note that the build Trilinos-atdm-white-ride-cuda-debug being clean now means that it can be used for automated PR testing for Trilinos.
This will show how long it takes to do an all-at-once build and we can see if 'bsub' crashes in the middle of a long build or not (or if just crashes in the middle of running tests). This it see if this CUDA build on 'white' is a viable options for an auto PR test build of Trilinos (see #2464).
Status update ... The I set up an all-at-once version of this build in: and I fired it off to submit to CDash. We will see how long an all-at-once build for this cuda-debug build takes. |
…evelop * 'develop' of https:/trilinos/Trilinos: (28 commits) MueLu: possible fix for issue trilinos#2340 MueLu: fix warnings MueLu ParameterListInterpreter test: Check return value of sed call MueLu ParameterListInterpreter test: Update gold files for SC=complex ShyLU: doc/intro rm unneeded file ShyLU: Add template for intro guide MiniTensor: Minor cosmetic changes. Add all-at-once cuda-debug build for white (trilinos#2464) Allow Trilinos_TRACK to already be set in env (trilinos#2511, TRIL-171) Fixes issue with Teuchos::SerialDenseMatrix randomization in parallel Sacado: Add explicit include for Cuda/Vectorization.hpp MueLu: And even working... MiniTensor: Added warning features, acceptable residual criteria, and stagnation checks MueLu: Additions compile now Promote newly passing builds to "ATDM" CDash Track/Group (TRIL-171, trilinos#2464) Set mpiexec --mca orte_abort_on_non_zero_status 0 ... (TRIL-198) Add ATDM_CONFIG_MPI_PRE_FLAGS (TRIL-198) Stokhos: Add specialization of Belos::MultiVecTraits for Xpetra. Allow MPI_EXEC to be specialized by platform/build setup (TRIL-198) Use default value in ATDM_SET_ATDM_VAR_FROM_ENV_AND_DEFAULT() (TRIL-171) ...
To make this happen, I had to copy ATDMDevEnv.cmake to ATDMDevEnvSettings.cmake and take out the include of ATDMDisables.cmake. I then made the file ATDMDevEnv.cmake include ATDMDevEnvSettings.cmake and ATDMDisables.cmake. I then made the ctest -S driver respond to the cache var ATDM_CONFIG_ENABLE_ALL_PACKAGE which results in allowing all of the Primary Tested packages to be enabled. The new build I added Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once enables all Primary Tested packages in Trilinos, including the onces that EMPIRE did not enable. However, the configure of Trilinos for this build currently fails with the error: Processing enabled package: ShyLU_Node (Tacho, Tests, Examples) CMake Error at packages/shylu/shylu_node/tacho/CMakeLists.txt:8 (MESSAGE): ShyLu/Tacho requires CUDA relocatable device code to be enabled if CUDA is enabled. Set: Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON This will need to be fixed after this starts submitting to CDash.
FYI: I set up a all-at-once cuda-debug build for Trilinos for all 53 Primary Tested packages on 'white' and 'ride' submitting to CDash as the build This currently fails the configure of ShyLU_Node as shown at: showing:
The current ATDM build of Trilinos does not enable ShyLU so those builds are not showing that configure failure. But note that SPARC does use ShyLU_Node (at at least some of its subpackages get enabled). Therefore, for now, I would recommend that we disable ShyLU_Node in this initial PT CUDA build of Trilinos targeted for PR testing (but not disable ShyLU_Node or anything else in the other auto PR builds). Getting something up is better than nothing for auto PR testing. @william76 and @jwillenbring, do you agree? |
@bartlettroscoe : Nope. Tacho is the primary test case for Kokkos tasking. It has exposed lot of subtle issues before. I wouldn't disable it, but I would set Kokkos_ENABLE_Cuda_Relocatable_Device_Code=ON which is a requirement set by Kokkos. I believe we need tasking for 1 out of the 2 ATDM applications. |
@srajama1 there is a bug in the NVIDIA linker which prevents many Trilinos packages from compiling with relocatable device code on (with tests enabled). This will probably limit how much of Trilinos we are able to test in that configuration. |
@ibaned : Didn't know this linker bug. Do you know which feature of Kokkos tasking requires relocatable device code ? |
@srajama1 I have been told the entire system known as "Kokkos tasking" (e.g. the spawn-based system) requires relocatable device code. I don't know in detail what parts would break if we don't have it enabled. This is a difficult trade-off, but at the moment that is the situation. I think @rppawlo has built reasonable subsets of Trilinos with relocatable device code, but maybe not with tests enabled. |
Ah, I believe most of the use cases that are needed by this PR (PR testing on GPUs for ATDM apps) would have been covered by @rppawlo . I believe we should be able to use relocatable device code in that case, assuming tests work. |
As of now, the EMPIRE configuration of Trilinos (which this current ATDM Trilinos configuration is matching) does NOT setting But it looks like some of the SPARC configurations of Trilinos do set
Is SPARC really using relocatable device code with Kokkos on 'ride'? Does this work with all of the Trilinos packages currently used by SPARC? @rppawlo and @nmhamster, It it worth trying to set |
I have only built up to phalanx testing with relocatable device code enabled. It was for experimenting with the device DAG support for assembly in phalanx. I did not test panzer, Tpetra or the linear solver stack as this only involved assembly. We only came across one real build issue in sacado due to a static variable for the kokkos memory pool. We needed an ifdef to change the static declaration depending on whether RDC was enabled. Everything else seemed to work fine. I would give it a shot. |
Short answer: no on using RDC with SPARC. We have issues with RDC in SPARC. I've backed this off in our Trilinos config but haven't pushed those changes to our sparc/Trilinos repo. |
FYI: While reviewing PR #4332, I just noticed that there is now a
This shows that SEACAS is being disabled. Since SEACAS plays a critical role in ATDM, we need to get SEACAS enables ASAP to protect ATDM and other important customers. This build should 100% match the build As I told @jwillenbring, I will look into what the problem with this CUDA PR build is and get it to match the build |
FYI: I am working on fixing the problems in the Trilinos CUDA PR build. A few things I am seeing right away:
I will run the configures and builds and get these to match up. Once I have everything matched up and the correct set of test disables added, I will post a PR to merge in this configuration. I will also provide detailed instructions on how I did this so others can copy this process in the future for future PR builds. |
…ilinos config (trilinos#2464) Several issues were fixed: * The correct KOKKOS_ARCH is now set * The correct set of TPLs is now enabled * The TPL include in libs are now set correctly * Several other critical options were set * Disables for already known failing tests where set This was done by simply diffing the cmake STDOUT and CMakeCache.txt files. Details will be provied in a comment in trilinos#2464.
…ilinos#4551, trilinos#2464) Has to be disabled for the CUDA PR build. Note, before this, no STK tests were being enabled at all.
I just posed PR #4592 which fixes the CUDA PR build. It enables all of SEACAS and STK (and all their tests) and it enables all 160 Panzer BASIC tests. And they all pass (except for one recent known STK test failure with CUDA described in #4551). The process I used to fix the build was pretty simple and is described in detail below (so that others can follow a similar process in the future). With all the configure iterations, it took about 4 hours to complete this matching (mostly because the configure is slow in the NFS mounted drive on 'ride'). Once I got the configure diffs to match up, the build and tests ran right out of the box (with the one expected failing STK test). Details on the process to fix the CUDA PR build to match the ATDM Trilinos buildFirst, I set up the build dir:
with the files
and
(NOTE: I set some of these options to better match the Trilinos CUDA PR build settings.) I ran the base configure with:
(NOTE: The Trilinos CUDA PR build always set I then set up a configure and build directory for the Trilinos CUDA PR configuration:
with files
and
I ran the configure with:
I created the script
I then ran it as:
I then compare the two sets of files with:
One difference I noted was:
It is not necessary to explicitly disable PyTrilinos in this context because it is not a Primary Tested package (so it would not get enabled). But that should be harmless as it will not impact the final set of enabled and non-enabled SE Packages and TPLs. Using the script
I did several of iterations of modifying the file
and carefully inspecting the diffs until I got them pretty close. The final diffs are shown in : The diffs that remained should not impact what builds and what passes and fails. I then did a full build and ran the test suite on 'ride' using the script
and I ran this with:
That gave the test results:
See, we now have 23 SEACAS tests, 15 STK tests and 163 Panzer tests! Before there were only 60 Panzer tests as shown, for example, in this recent CUDA PR build and no SEACAS or STK tests. The only failing test was
which showed:
I then cleaned up the commits and created the PR #4592. |
As I said on the other issue, we will try to get a stk update in, to fix the failing test, asap. |
@trilinos/framework, With more code being enabled in the CUDA PR build due to the changes in PR #4592 the build times have gone up a lot as shown in the PR build #4592 on CDash. That shows a build time of This must mean that the number of build processes is not correct. If you look at the Jenkins build for these at: and look at the output for example at: you can see:
So you want to set 64 build processes and 8 parallel ctest MPI processes (i.e. |
@trilinos/framework, The updated CUDA PR build in PR #4592 failed with 6 failing tests, 1 timing-out ROL test, 3 timing out Panzer tests and 2 failing Panzer tests that show CUDA allocation "out of memory" failures. This is likely due to using too high of a parallel level with
If that means that is is using Can someone on the @trilinos/framework team please update this CUDA PR build to use 64 parallel processes and only 8 parallel ctest processes on 'ride'? That will result in a total wall-clock time of a little over 4 hours in the worst case where now the CUDA PR build looks like it takes almost 5 hours and results in failing and timing out tests. |
…igh (trilinos#2464) Current the Trilinos CUDA PR build is running on 'ride' with `ctest -j29`. This causes tests to timeout and crash due to running out of CUDA memory. This job needs to be reduced to only use `ctest -j8`.
…r-build-config Automatically Merged using Trilinos Pull Request AutoTester PR Title: Fix CUDA PR build to enable SEACAS, STK, and 103 extra Panzer tests (#2464) PR Author: bartlettroscoe
…s:develop' (625e220). * trilinos-develop: Temp disable some tests failing becuase ctest parallel level is too high (trilinos#2464) Ifpack2 - use KOKKOS_RESTRICT Ifpack2 - remove shadow warning Ifpack2 - add static inline to remove multiple definition of functions Disable known failing test STKUnit_tests_stk_ngp_test_utest_MPI_4 (trilinos#4551, trilinos#2464) WIP: Update CUDA PR build settings to correctly match working ATDM Trilinos config (trilinos#2464) No need to set new AAO features after cmake 3.10.0 upgrade (trilinos#1761) Ifpack2 - fix a typo trilinos#4388 Ifpack2 - change vector loop Ifpack2 - check point for debugging Ifpack2 - put profilier stop at the beginning of test Ifpack2 - little bit of improvement on extract part Ifpack2 - remove unused impl KokkosBatched - ifpack2 need some new functions from updated kokkoskernels Ifpack2 - improvement on block spmv Ifpack2 - for jacobi solver, invert diagonals and solve with gemv Ifpack2 - improvement by using large team size
…s:develop' (625e220). * trilinos-develop: Temp disable some tests failing becuase ctest parallel level is too high (trilinos#2464) Ifpack2 - use KOKKOS_RESTRICT Ifpack2 - remove shadow warning Ifpack2 - add static inline to remove multiple definition of functions Disable known failing test STKUnit_tests_stk_ngp_test_utest_MPI_4 (trilinos#4551, trilinos#2464) WIP: Update CUDA PR build settings to correctly match working ATDM Trilinos config (trilinos#2464) No need to set new AAO features after cmake 3.10.0 upgrade (trilinos#1761) Ifpack2 - fix a typo trilinos#4388 Ifpack2 - change vector loop Ifpack2 - check point for debugging Ifpack2 - put profilier stop at the beginning of test Ifpack2 - little bit of improvement on extract part Ifpack2 - remove unused impl KokkosBatched - ifpack2 need some new functions from updated kokkoskernels Ifpack2 - improvement on block spmv Ifpack2 - for jacobi solver, invert diagonals and solve with gemv Ifpack2 - improvement by using large team size
@trilinos/framework, is this done done? Has the parallel test level been reduced to 8 (or so) and have all the temp disables been removed? |
This has been done for a while. The CUDA PR build looks to be one of the most robust PR builds being used. Closing as complete. |
Horray! |
CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott, @nmhamster
Description
This Issue is to scope out and track efforts to set up a CUDA build of Trilinos to be used as an auto PR build as described in #2317 (comment).
For this build it was agreed to use that ATDM build on
white
that is currently running and submitting to CDash. Questions about how to extend this build to be used as an auto PR build include:bsub
command robust enough to be a reliable PR build?Tasks:
white
until it is 100% clean [Done]Related Issues:
The text was updated successfully, but these errors were encountered: