Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New failing tests in ATDM debug builds of Trilinos due to KOKKOS_ENABLE_DEBUG=ON being set #2471

Closed
bartlettroscoe opened this issue Mar 28, 2018 · 17 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Amesos2 pkg: Anasazi pkg: Kokkos pkg: KokkosKernels pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 28, 2018

CC: @trilinos/kokkos, @trilinos/kokkos-kernels, @trilinos/amesos2 , @trilinos/anasazi, @trilinos/panzer

Next Action Status

The PR #2476 fixed two of the tests on 3/30/2018 and PR #2494 disabled one single unit test on 4/3/2018 not appropriate to run on GPUs.

Description

As shown in the query:

several tests are timing out today and failing in the ATDM -debug builds of Trilinos:

  • Amesos2_KLU2_UnitTests_MPI_2
  • Anasazi_Epetra_ModalSolversTester_MPI_4
  • KokkosCore_UnitTest_Cuda_MPI_1
  • KokkosKernels_sparse_cuda_MPI_1
  • PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4

The set of tests that are failing and which platforms they are failing shown in the above query are shown in the below table:

Table of failing tests (click to expend)
Site Build Name Test Name Status Time Details
hansen Trilinos-atdm-hansen-shiller-cuda-debug Amesos2_KLU2_UnitTests_MPI_2 Failed 600.09 Completed (Timeout)
hansen Trilinos-atdm-hansen-shiller-gnu-debug-openmp Amesos2_KLU2_UnitTests_MPI_2 Failed 8.51 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-gnu-debug-serial Amesos2_KLU2_UnitTests_MPI_2 Failed 600.51 Completed (Timeout)
hansen Trilinos-atdm-hansen-shiller-intel-debug-openmp Amesos2_KLU2_UnitTests_MPI_2 Failed 2.39 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-intel-debug-serial Amesos2_KLU2_UnitTests_MPI_2 Failed 600.1 Completed (Timeout)
ride Trilinos-atdm-white-ride-cuda-debug Amesos2_KLU2_UnitTests_MPI_2 Failed 600.05 Completed (Timeout)
white Trilinos-atdm-white-ride-cuda-debug Amesos2_KLU2_UnitTests_MPI_2 Failed 600.04 Completed (Timeout)
white Trilinos-atdm-white-ride-gnu-debug-openmp Amesos2_KLU2_UnitTests_MPI_2 Failed 1.31 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug Anasazi_Epetra_ModalSolversTester_MPI_4 Failed 0.84 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 Failed 0.71 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-cuda-debug KokkosCore_UnitTest_Cuda_MPI_1 Failed 103.19 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug KokkosCore_UnitTest_Cuda_MPI_1 Failed 213.56 Completed (Failed)
white Trilinos-atdm-white-ride-cuda-debug KokkosCore_UnitTest_Cuda_MPI_1 Failed 213.74 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-cuda-debug KokkosKernels_sparse_cuda_MPI_1 Failed 16.49 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug KokkosKernels_sparse_cuda_MPI_1 Failed 2.43 Completed (Failed)
white Trilinos-atdm-white-ride-cuda-debug KokkosKernels_sparse_cuda_MPI_1 Failed 2.39 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-cuda-debug PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 Failed 21.95 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 Failed 24.31 Completed (Failed)
white Trilinos-atdm-white-ride-cuda-debug PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 Failed 24.9 Completed (Failed)

Except for the failing Anasazi tests in the build Trilinos-atdm-white-ride-cuda-debug (which I will write another GitHub issue for), all of these tests (even the timeouts) seem to be failing due to debug-mode checks from KOKKOS_ENABLE_DEBUG=ON being set (see #2439) failing and throwing exceptions. In the case of the failing tests Amesos2_KLU2_UnitTests_MPI_2, for example, it shows:

5. KLU2_double_int_int_NonContgGID_UnitTest ... 
 
 p=0: *** Caught standard std::exception of type 'std::runtime_error' :
 
  View bounds error of view MV::DualView ( -1 < 6 , 0 < 1 )
  Traceback functionality not available
  
 [FAILED]  (0.0885 sec) KLU2_double_int_int_NonContgGID_UnitTest
 Location: /home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/amesos2/test/solvers/KLU2_UnitTests.cpp:383

This exception causes a hang and a timeout in some cases and fails quickly and aborts in other cases. (So much for assuming that one MPI process throwing an excpetion will bring down an MPI job in all cases.)

Many of these builds have been promoted to the "ATDM" CDash group/track and therefore triggered CDash error emails today. Therefore, this must get fixed quickly if possible (or we will need to demote these builds again).

Steps to Reproduce

One can log onto white (SON) or ride (SRN) and then reproduce the build and tests as described at:

I just reproduced many of these failures on 'white' using

$ ssh white

$ cd ~/rilinos.base/BUILD/WHITE/CHECKIN/

$ bsub -x -I -q rhel7F -n 16 \
  ./checkin-test-atdm.sh cuda-debug --enable-packages=Kokkos,KokkosKernels,Amesos2,Panzer --local-do-all

...

FAILED (NOT READY TO PUSH): Trilinos: white22

Wed Mar 28 11:03:05 MDT 2018

Enabled Packages: Kokkos, KokkosKernels, Amesos2, Panzer

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-debug => FAILED: passed=189,notpassed=5 => Not ready to push! (120.13 min)


REQUESTED ACTIONS: FAILED

This showed the test results:

  97% tests passed, 5 tests failed out of 194
  
  Subproject Time Summary:
  Amesos2          = 1232.97 sec*proc (8 tests)
  Kokkos           = 954.94 sec*proc (26 tests)
  KokkosKernels    = 870.46 sec*proc (8 tests)
  Panzer           = 7490.79 sec*proc (152 tests)
  
  Total Test time (real) = 1518.22 sec
  
  The following tests FAILED:
  	  2 - KokkosCore_UnitTest_Cuda_MPI_1 (Failed)
  	 28 - KokkosKernels_sparse_cuda_MPI_1 (Failed)
  	 35 - Amesos2_KLU2_UnitTests_MPI_2 (Timeout)
  	174 - PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 (Timeout)
  	192 - PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 (Failed)
  Errors while running CTest
  
  Total time for cuda-debug = 120.13 min

The test failure timeout PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 was also seen in #2446 as well. Not sure why that test timed out when run locally but not in the driver jobs. But otherwise, this one build reproduced all of the failing tests shown on CDash except for the test Anasazi_Epetra_ModalSolversTester_MPI_4 (which does not look to be related to KOKKOS_ENABLE_DEBUG=ON).

Related Issues

@mhoemmen
Copy link
Contributor

I'm OK with the Anasazi test getting disabled on this platform. I don't think ATDM customers need Anasazi even as far as I know.

@ndellingwood
Copy link
Contributor

I'll disable the Amesos2 KLU failing tests for now when KOKKOS_ENABLE_DEBUG is on until a fix is ready, configuring/building right now.

@bartlettroscoe
Copy link
Member Author

The details of the failing tests:

  • Amesos2_KLU2_UnitTests_MPI_2
  • KokkosCore_UnitTest_Cuda_MPI_1
  • KokkosKernels_sparse_cuda_MPI_1
  • PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4

is shown in the below details. All of these failures look to be caused by the enable of KOKKOS_ENABLE_DEBUG=ON and these are the only builds of Trilinos shown on CDash today that seem to trigger these failures.

Looking at these failures, and looking at the new commmits pulled shown at:

I don't see any commits to Kokkos itself that would account for these new failures. Therefore, I think that the option Kokkos_ENABLE_Debug_Bounds_Check:BOOL=ON that the EMPIRE build of Trilinos is setting does not actually enable any Kokkos debug-mode checking. Perhpas it did in the past but the refactoring of the Kokkos CMake setup as part of #1400 likely changed the name of this variable.

@trilinos/amesos2, @trilinos/kokkos, @trilinos/kokkos-kernels, and @trilinos/panzer developers,

Can we get these failures cleaned up pretty quickly? If we don't we are going to spam developers with CDash error emails every day (which can't happen). If we can't get these cleaned up by say tomorrow night, I can demote these debug builds back to the "Specialized" track for now . Then we can work to get these cleaned up offline. Let me know.

DETAILS (click to expand)

Now to dig into these failing tests and see why they fail and to confirm that it was the enable of KOKKOS_ENABLE_DEBUG=ON that is causing these tests to fail.

A) Amesos2_KLU2_UnitTests_MPI_2:

https://testing.sandia.gov/cdash/testDetails.php?test=45969470&build=3469152

shows the failure:

5. KLU2_double_int_int_NonContgGID_UnitTest ... 
 
 p=0: *** Caught standard std::exception of type 'std::runtime_error' :
 
  View bounds error of view MV::DualView ( -1 < 6 , 0 < 1 )
  Traceback functionality not available
  
 [FAILED]  (0.465 sec) KLU2_double_int_int_NonContgGID_UnitTest
 Location: /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/amesos2/test/solvers/KLU2_UnitTests.cpp:383

Looking at:

the only build this test is also failing is for the build Linux-gcc-5.3.0-OPENMPI-1.8.7_RELEASE_KOKKOS-REFACTOR_EXPERIMENTAL_CUDA-8.0.44.

B) KokkosCore_UnitTest_Cuda_MPI_1:

https://testing.sandia.gov/cdash/testDetails.php?test=45967252&build=3469077

shows the failure:

[ RUN      ] cuda.triple_nested_parallelism
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelReduce< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.

Looking at the query:

this test only runs in the ATDM builds of Trilinos and only fails in the cuda-debug builds of Trilinos.

C) KokkosKernels_sparse_cuda_MPI_1:

https://testing.sandia.gov/cdash/testDetails.php?test=45967928&build=3469089

shows the failure:

[ RUN      ] cuda.sparse_gauss_seidel_double_int_int_TestExecSpace
:0: : block: [2,0,0], thread: [0,58,0] Assertion `View bounds error of view PermutationVector` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

Looking at the query:

this test only runs in the ATDM builds of Trilinos and only fails in the cuda-debug builds of Trilinos.

D) PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4:

https://testing.sandia.gov/cdash/testDetails.php?test=45973239&build=3469238

shows the failure:

******* WARNING *******
Hierarchy::ReplaceCoordinateMap: matrix and coordinates maps are same, skipping...
Using default factory (MueLu::AmalgamationFactory) for building 'UnAmalgamationInfo'.
Level 0
 Setup Smoother (MueLu::Ifpack2Smoother{type = RELAXATION})
:0: : block: [0,0,0], thread: [0,51,0] Assertion `View bounds error of view PermutationVector` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

Looking at the query:

this test runs in many builds of Trilinos but only fails in the ATDM cuda-debug builds of Trilinos.

ndellingwood added a commit to ndellingwood/Trilinos that referenced this issue Mar 28, 2018
Temporary fix to address issue trilinos#2471 by disabling failing tests when
KOKKOS_ENABLE_DEBUG=ON until failure is triaged and fixed.
ndellingwood added a commit that referenced this issue Mar 28, 2018
Temporary fix to address issue #2471 by disabling failing tests when
KOKKOS_ENABLE_DEBUG=ON until failure is triaged and fixed.
@ndellingwood
Copy link
Contributor

@bartlettroscoe PR #2472 merged - disables the failing Amesos2 KLU tests when debugging is enabled.

@ibaned
Copy link
Contributor

ibaned commented Mar 28, 2018

The KokkosCore failure is probably due to a kernel using more registers than available on a K80. It doesn't fail on a Titan X. The KokkosKernels failure is due to a bug in Kokkos::Sort. I've got a fix for it, working on testing and pushing.

@bartlettroscoe
Copy link
Member Author

The KokkosCore failure is probably due to a kernel using more registers than available on a K80. It doesn't fail on a Titan X.

How do we fix that? Note that when you turn the debug check off, it seems to run just fine. Is this a false check for this system?

The KokkosKernels failure is due to a bug in Kokkos::Sort. I've got a fix for it, working on testing and pushing.

Thanks!

@bartlettroscoe
Copy link
Member Author

So that just leaves the failing Panzer test PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 shown at:

showing the debug checking failure:

 Setup Smoother (MueLu::Ifpack2Smoother{type = RELAXATION})
:0: : block: [0,0,0], thread: [0,51,0] Assertion `View bounds error of view PermutationVector` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

@rppawlo or @jmgate, is there some Panzer developer that can look into fixing this debug-mode check failure?

@ibaned
Copy link
Contributor

ibaned commented Mar 28, 2018

@bartlettroscoe I suspect the Panzer failure has the same root cause as the KokkosKernels failure

@bartlettroscoe
Copy link
Member Author

I suspect the Panzer failure has the same root cause as the KokkosKernels failure

@ibaned, so that means that your fix and push will likely fix that Panzer test too then? Great! Thanks!

That means that all of these failures will likely get cleaned up pretty soon then.

@ibaned
Copy link
Contributor

ibaned commented Mar 29, 2018

@bartlettroscoe The fix is in pull request #2476

@bartlettroscoe
Copy link
Member Author

Looks like the merge PR #2476 fixed the failing tests KokkosKernels_sparse_cuda_MPI_1 and PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 for the build Trilinos-atdm-hansen-shiller-cuda-debug today as shown at:

But it did not fix the failing test KokkosCore_UnitTest_Cuda_MPI_1 shown as shown at:

As a reminder, that test shows the failure:

[ RUN      ] cuda.triple_nested_parallelism
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelReduce< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] cuda.triple_nested_parallelism (29 ms)
[...]
[  PASSED  ] 116 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] cuda.triple_nested_parallelism

 1 FAILED TEST

@ibaned, any idea how to fix this last failing test? Why does this test pass just fine when KOKKOS_ENABLE_DEBUG=ON is not set such as in the companion build Trilinos-atdm-hansen-shiller-cuda-opt with results shown at:

?

@bartlettroscoe
Copy link
Member Author

The KokkosCore failure is probably due to a kernel using more registers than available on a K80.

How do we get this Kokkos kernel to stop doing that?

Note that this test also fails in the same way on 'white' as shown at:

which shows:

[ RUN      ] cuda.triple_nested_parallelism
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelReduce< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] cuda.triple_nested_parallelism (41 ms)

Does the Kokkos team run this same unit test on other CUDA platforms and does it pass there? If so, where can we see the results for that?

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 1, 2018

How do we get this Kokkos kernel to stop doing that?

We could disable the test for K80 architectures....

@ibaned
Copy link
Contributor

ibaned commented Apr 2, 2018

@swbova just saw this fail on P100s as well with KOKKOS_DEBUG on.

ibaned added a commit that referenced this issue Apr 2, 2018
This test requests a hardcoded number of
32 CUDA threads per warp, but with debugging
enabled the CUDA kernel uses too many registers
and can only run on 16 threads per warp max.
[kokkos/kokkos#1514, kokkos/kokkos#1513, #2471]
@ibaned
Copy link
Contributor

ibaned commented Apr 2, 2018

Pull request #2494 should fix the failing Kokkos unit test.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Apr 3, 2018

The test KokkosCore_UnitTest_Cuda_MPI_1 is now shown newly passing in the builds:

Therefore, I believe that this issue is resolved.

I am not going to mark this issue with the new "Disabled Tests" label because this was a very targeted disable and this one unit test just does not seem to be be written in way that works with GPUs with debug checking turned on. Therefore, I am just going to close this.

Now every Trilinos user and developer will have Kokkos debug-mode checking turned on by default when they configure with -DTrilinos_ENABLE_DEBUG=ON or -DCMAKE_BUILD_TYPE=DEBUG.

@ibaned, thanks for all of your help in getting these failures cleaned up!

Closing as complete!

@bartlettroscoe
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Amesos2 pkg: Anasazi pkg: Kokkos pkg: KokkosKernels pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants