Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Kokkos, KokkosKernels, and Panzer test failures on CUDA 8.0 and CUDA 9.0 builds after Kokkos and KokkosKernels update #2827

Closed
bartlettroscoe opened this issue May 26, 2018 · 70 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Intrepid2 pkg: Kokkos pkg: KokkosKernels pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented May 26, 2018

CC: @trilinos/kokkos, @trilinos/kokkos-kernels, @trilinos/panzer, @ndellingwood

Next Action Status

Kokkos, KokkosKernels, and Panzer failing and timing-out tests have been fixed by PRs #2863, #2874, #2927, and #2964 . No Panzer, Kokkos or KokkosKernels failures observed 6/19 or 6/20/2018.

Description

The Kokkos and KokkosKernels updates in the recent commits 51cb7c5 and 816e703:

51cb7c5:  Merge branch 'develop' into kokkos-promotion
Author: ndellingwood <[email protected]>
Date:   Thu May 24 23:55:26 2018 -0600

816e703:  Snapshot of kokkos-kernels.git from commit 1a7b524ba38fdfab6c1058065af06cbcb4a2ce6f
Author: Nathan Ellingwood <[email protected]>
Date:   Thu May 24 23:30:27 2018 -0600

seem to have triggered several new test failures and timeouts in the packages in Kokkos, KokkosKernels, and Panzer as shown in:

The new failing and timing-out tests are:

Test Status Details
KokkosContainers_UnitTest_Serial_MPI_1 Failed Completed (Timeout)
KokkosCore_UnitTest_Cuda_MPI_1 Failed Completed (Failed)
KokkosKernels_sparse_serial_MPI_1 Failed Completed (Timeout)
PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-2 Failed Completed (Failed)
PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-3 Failed Completed (Failed)
PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 Failed Completed (Failed)
PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2 Failed Completed (Failed)
PanzerAdaptersSTK_PoissonExample-ConvTest-Quad-Order-3 Failed Completed (Failed)
PanzerAdaptersSTK_PoissonExample-ConvTest-Quad-Order-4 Failed Completed (Failed)
PanzerAdaptersSTK_PoissonExample-ConvTest-Tri-Order-3 Failed Completed (Failed)
PanzerAdaptersSTK_PoissonExample-ConvTest-Tri-Order-4 Failed Completed (Failed)

which failed in one or more of the unique builds:

  • Trilinos-atdm-hansen-shiller-cuda-8.0-debug
  • Trilinos-atdm-hansen-shiller-cuda-8.0-opt
  • Trilinos-atdm-white-ride-cuda-debug
  • Trilinos-atdm-white-ride-cuda-opt

These are all basically CUDA 8.0 builds.

These commits were shown pulled in this testing day at:

Steps to Reproduce

The most failures are produced on the Trilinos-atdm-white-ride-cuda-debug build on 'white' and 'ride' so that is likely the bet bet to use to reproduce these failures. Therefore, as described in:

after logging into 'white' or 'ride' and cloning the Trilinos Git repo (pointed to by TRILINOS_DIR) and getting on the 'develop' branch, one would do:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON \
  -DTrilinos_ENABLE_Kokkos=ON \
  -DTrilinos_ENABLE_KokkosKernels=ON \
  -DTrilinos_ENABLE_Panzer=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Kokkos pkg: KokkosKernels client: ATDM Any issue primarily impacting the ATDM project labels May 26, 2018
@bartlettroscoe
Copy link
Member Author

@trilinos/kokkos

The failing test KokkosCore_UnitTest_Cuda_MPI_1 shows the failure:

[ RUN      ] cuda.triple_nested_parallelism
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelReduce< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] cuda.triple_nested_parallelism (29 ms)

for example, as shown at:

This test failures this way for the builds:

  • Trilinos-atdm-hansen-shiller-cuda-8.0-debug
  • Trilinos-atdm-white-ride-cuda-debug

on 'hansen', 'white', and 'ride'.

This sounds a lot like the prior failure reported and addressed in #2471. Does this one unit test just need to be disabled on these machines as well?

@bartlettroscoe
Copy link
Member Author

@trilinos/panzer

The PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-2 times out in the 'debug' builds but fails in the 'opt' builds as shown at:

which shows the output:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-8.0-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

[hansen04:30492] *** Process received signal ***
[hansen04:30492] Signal: Aborted (6)
[hansen04:30492] Signal code:  (-6)

The same failure is shown for the following Panzer tests in the 'opt' builds:

Is this an error in Kokkos or an error in the way that Panzer is using Kokkos?

@bartlettroscoe
Copy link
Member Author

@trilinos/kokkos, @trilinos/kokkos, @trilinos/panzer,

Should this Kokkos and KokkosKernels update be backed out or can it be fixed pretty quickly?

@ndellingwood
Copy link
Contributor

@bartlettroscoe I'm rebuilding a cuda8 with debug build with your config instructions. Cuda debug builds are especially slow, so I may not be able to dig too far into this until tomorrow, not sure from the cdash output exactly what is causing the error in the Panzer examples.

As far as the Kokkos and KokkosKernels tests, they should be disabled for the debug builds. The nested parallelism test runs into problems with GPU resources (I don't recall the specifics, I'll have to dig back through the issues where this was discussed for reminder and reference), the serial spgemm test is just really slow and even worse in debug mode. I'm not sure how running of the tests is wired into the testing harness or scripts, but for Kokkos adding gtest_filter=-cuda.triple_nested_parallelism for the KokkosCore_UnitTest_Cuda_MPI_1 test, gtest_filter=-serial.UnorderedMap_failed_insert for the KokkosContainers_UnitTest_Serial_MPI_1 test and for KokkosKernels gtest_filter=serial.sparse_spgemm_double_int_size_t_TestExecSpace for KokkosKernels_sparse_serial_MPI_1 will disable the appropriate tests within their exe files. If there is another issue where this was done for Kokkos I can try and pattern match, or if it is pretty easy for you to disable these that would be very helpful.

@bartlettroscoe
Copy link
Member Author

@ndellingwood, I will see about adding those disables to the KokkosCore and KokkosContainers tests.

The other issue are the failing Panzer tests showing:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-8.0-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

[hansen04:30492] *** Process received signal ***
[hansen04:30492] Signal: Aborted (6)
[hansen04:30492] Signal code:  (-6)

Who should debug those?

@ndellingwood
Copy link
Contributor

@bartlettroscoe I've been able to gather some info on one of the Panzer failures by enabling Panzer's examples in a cuda build I had on White (release build). I have a separate debug build going based on the config info you provided. I'm not able to use xterm on White to run cuda-gdb for the 4 mpi proc tests which is making pinning down the issue a bit of work, which will determine who gets to help fix it.

Running this failing test PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-3:

  • First I started an interactive session
    bsub -Is -n 4 -q rhel7F bash
  • Ran the job like this:
    mpiexec -np 4 ./PanzerAdaptersSTK_CurlLaplacianExample.exe --use-epetra --use-twod --cell="Quad" --x-elements=4 --y-elements=4 --z-elements=4 --basis-order=3
    (I wasn't trying to run ctest with multiple jobs requiring all the extra binding info to make it easier to try and run through cuda-gdb, which looks like it won't work after all).

Looking at the debug stack trace and adding a few print statements in the Intrepid2 file packages/intrepid2/src/Orientation/Intrepid2_OrientationToolsDefCoeffMatrix_HCURL.hpp before/after line 297, it looks like this test is dying in the Kokkos::deep_copy call - print statements added before the deep_copy return info about the sizes used to create subviews (nothing weird there), but print statement to indicate successfully completing the deep_copy never occurs - the std::runtime and cudaDeviceSynchronize() error( cudaErrorIllegalAddress) errors occur.

Here is the potential view_copy culprit in Kokkos (I added newlines to separate the function arguments):
void Kokkos::Impl::view_copy<Kokkos::View<double**, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Cuda, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> > > (Kokkos::View<double**, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Cuda, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> > const&, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> > const&)

@crtrott can the recent eti changes have resulted in different behavior than before? I started going through the changes but haven't made it far enough through to reconcile with earlier code. In any case, it's not clear to me why this would be choking with the MDRangePolicy parallel_for now but not before the change.

I probably won't be able to do much more tonight, can pick up tomorrow.

@mperego
Copy link
Contributor

mperego commented May 27, 2018

@ndellingwood thanks for looking into this. Let me and Kyungjoo know if you need help with that.

@ndellingwood
Copy link
Contributor

Thanks @mperego ! The issue is manifesting in Intrepid2 through a deep_copy call, if it is an actual Intrepid2 issue and not Kokkos I will check with you and Kyungjoo.

I did not have much time to look farther in this, but a couple additional pieces of info I've gathered for reference:
This is dying in the ViewCopy call of Kokkos_CopyViews.hpp at line 684: DstExecCanAccessSrc is true and iterate is Kokkos::Iterate::Left

I add some print statements to the operator() that this ends up calling to print out indices etc. during out-of-bounds memory accesses during the element-wise copy, and they never were triggered. I did this to rule out the an illegal memory access was encountered message occurring due to incorrect bounds during the copy. So the problem seems to be a memory space issue, even though fences are placed before and after the parallel_for call see line 462 of Kokkos_CopyViews.hpp. A couple things to check next - ETI for deep_copy with UVM and AnonymousSpace...

@crtrott
Copy link
Member

crtrott commented May 29, 2018

Why did this stuff not trigger in our integration builds? Do we have any idea?

@ndellingwood
Copy link
Contributor

Panzer examples are not enabled in the integration builds.

@ndellingwood
Copy link
Contributor

We can add this atdm cuda-dbg build as part of integration testing from here on.

@ibaned
Copy link
Contributor

ibaned commented May 29, 2018

Actually, I recommend we replace our current CUDA integration configuration with one or more of these ATDM configurations.

@ndellingwood
Copy link
Contributor

Since shepard will be taken out of service soon we should also use the ATDM configurations for those builds as well.

@bartlettroscoe
Copy link
Member Author

What is interesting is there the CUDA 9.0 build on 'hansen' does not show any failing tests for Kokkos or KokkosKernels as shown at:

However, we don't have results for Panzer tests because the Jenkins builds timed out before they could run. (I am addressing that in #2832.)

Is Kokkos and KokkosKernels tested with CUDA 8.0?

@ndellingwood
Copy link
Contributor

Kokkos and KokkosKernels are both tested with Cuda 8.0 and 9.0 as individual packages, but only 8.0 was used during the Trilinos Integration testing.

@ibaned
Copy link
Contributor

ibaned commented May 29, 2018

One cause of timeouts is running CUDA tests in parallel with CTest, because you can have multiple tests competing for GPU resources and running out of memory when running the tests individually would not run out of memory. UVM will exacerbate this.

@crtrott
Copy link
Member

crtrott commented May 29, 2018

Ok another issue is that some of those DEBUG failures only happen on Kepler, while our debug builds of Kokkos proper happen on Pascal or Volta. I guess we need more testing .....

@bartlettroscoe
Copy link
Member Author

@ibaned,

One cause of timeouts is running CUDA tests in parallel with CTest, because you can have multiple tests competing for GPU resources and running out of memory when running the tests individually would not run out of memory. UVM will exacerbate this.

I think @rppawlo confirmed that that is happening. But these tests passed in the debug build before this update. I will try to disable the specific unit tests as recommended by @ndellingwood above. But since these pass with CUDA 9.0 I will try to disable them only for the CUDA 8.0 builds.

@ibaned
Copy link
Contributor

ibaned commented May 29, 2018

I think it would be more appropriate not to use CTest parallelism for CUDA builds, until the parallel test system is sophisticated enough to associate each test with a different GPU. This will entirely prevent non-deterministic failures due to resource exhaustion.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented May 29, 2018

I think it would be more appropriate not to use CTest parallelism for CUDA builds, until the parallel test system is sophisticated enough to associate each test with a different GPU. This will entirely prevent non-deterministic failures due to resource exhaustion.

We could experiment to see but I think the increase in wall clock time would be pretty substantial. Also, every Trilinos test does not use CUDA.

@mhoemmen
Copy link
Contributor

@bartlettroscoe CUDA builds are special; I'm OK if tests take longer. I want deterministic tests :D

@eric-c-cyr
Copy link
Contributor

I've been following this conversation with some interest. I believe the solution of turning off certain tests in select builds is fine (I would prefer smaller tests, but...). I feel that running with "-O0 -g" is important given the design choices being made in Kokkos, Tpetra and up. The use of templates and inline, being hard to understand when things are not optimized out. Being able to walk through a complete stack is useful (also some people use c style asserts...).

@mhoemmen
Copy link
Contributor

I'm with @eric-c-cyr -- I think it could help to have at least one Dashboard build with -O0 -g. It's helpful to know if a test fails with -O3 but not with -O0. I think we could do with fewer -O0 builds, though -- maybe just one for each supported computer architecture, using its preferred compiler (e.g., Intel for KNL, xlC for IBM POWER).

@srajama1
Copy link
Contributor

Ok, we will take a look at the sparse tests in Kokkos Kernels.

bartlettroscoe added a commit that referenced this issue Jun 12, 2018
…os-kernels-unit-tests

Selective disables of a few individaul unit tests in Kokkos and KokkosKernels for cuda-debug build on 'white' and 'ride'.  Should address the remaining failures for #2827.
@bartlettroscoe
Copy link
Member Author

FYI: PR #2927 was just merged. That should address the last of the Kokkos and KokkosKernels failures/timeouts. We should get confirmation for the ATDM Trilinos builds run on 6/13/2018 and then we can close this issue.

@bartlettroscoe
Copy link
Member Author

There were some more timed-out Kokkos and KokkosKernels tests shown here that did not timeout the last two days after my commit e6a1b58 which was merged in the PR #2927 on 6/12/2018.

I went back and did a more detailed analysis of the Kokkos and KokkosKernels test suite for debug builds over the last week looking for timeouts or near timeouts. From that detailed analysis of CDash data (which took me about 3 hours to complete), I would like to suggest we disable the following additional additional unit tests:

  • KokkosContainers_UnitTest_Serial_MPI_1: Add disable of unit test serial.scatterview for build Trilinos-atdm-white-ride-gnu-debug-openmp ...

  • KokkosContainers_UnitTest_OpenMP_MPI_1: Add disable of unit test openmp.UnorderedMap_failed_insert for the build Trilinos-atdm-white-ride-gnu-debug-openmp ...

  • KokkosKernels_graph_serial_MPI_1: Add disable of unit test serial.graph_graph_color_d2_double_int_int_TestExecSpace (but not the size_t unit test) for the build Trilinos-atdm-white-ride-gnu-debug-openmp and Trilinos-atdm-white-ride-cuda-debug ...

  • KokkosKernels_sparse_openmp_MPI_1: Add disables of unit tests openmp.sparse_block_gauss_seidel_double_int_int_TestExecSpace and openmp.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace (but not the size_t unit tests) for build Trilinos-atdm-white-ride-gnu-debug-openmp ...

  • KokkosKernels_sparse_serial_MPI_1: Add disables of individual unit tests serial.sparse_block_gauss_seidel_double_int_int_TestExecSpace and serial.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace (but not the size_t unit tests) for the builds Trilinos-atdm-hansen-shiller-gnu-debug-serial, Trilinos-atdm-hansen-shiller-gnu-debug-serial, Trilinos-atdm-hansen-shiller-intel-debug-serial, Trilinos-atdm-hansen-shiller-intel-debug-openmp, Trilinos-atdm-hansen-shiller-cuda-8.0-debug. Also, add back the unit test serial.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace and remove serial.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace for better test coverage ...

What is important to note that this would only disable the int individual unit test for the pair of unit tests for types int and size_t for the KokkosKernels tests. That way we are still testing the algorithm for the size_t type. And these disables are just for the debug builds, not the 'opt' builds where these tests run plenty fast. Therefore the 'opt' builds test everything.

I will create a PR with these very targeted individual unit test disables.

DETAILED NOTES: (click to expand)

Starting today 6/15/2018, we are seeing some new Kokkos and KokkosKernels test timeouts in debug (i.e. -DCMAKE_BUILD_TYPE=DEBUG -DTrilinos_ENABLE_DEBUG=ON) ATDM Trilinos as shown in this query. For the build Trilinos-atdm-white-ride-gnu-debug-openmp on 'white', we are seeing the most timeouts:

  • KokkosContainers_UnitTest_Serial_MPI_1
  • KokkosKernels_graph_serial_MPI_1
  • KokkosKernels_sparse_openmp_MPI_1

As I write this, full test results are not in for all of the Trilinos ATDM builds be we are also seeing the test KokkosKernels_sparse_openmp_MPI_1 timing out in the builds:

  • Trilinos-atdm-hansen-shiller-gnu-debug-serial on 'hansen'
  • Trilinos-atdm-white-ride-gnu-debug-openmp

These same tests were timing out in the Trilinos-atdm-white-ride-cuda-debug build on 'white' and I disabled some individual unit tests that were causing that in the commit e6a1b58 but only for the build Trilinos-atdm-white-ride-cuda-debug.

It looks like we are going to need to disable these more expensive individual unit tests in other 'debug' builds as well. But first let's look at the history for these tests over the last few days:

  • This query for KokkosContainers_UnitTest_Serial_MPI_1 in build Trilinos-atdm-white-ride-gnu-debug-openmp between 6/6/9 and 6/16/2016 shows that this test timed out every time it was run on 'white' and 'ride' but it only ran a total of 4 times in the last 7 days (when it could have run 14 times between the two of them). This was likely due to the 'bsub' command crashing. These tests all timed out while running the unit test serial.scatterview. So this individual unit test just needs to be disabled for this full debug build.

  • This query for KokkosKernels_graph_serial_MPI_1 in build Trilinos-atdm-white-ride-gnu-debug-openmp between 6/6/9 and 6/16/2016 shows that this test timed all 4 times it was run on 'white' and 'ride' (but it should have run 14 times but did not due to 'bsub' crashes). The test output shown here shows that the individual unit test serial.graph_graph_color_d2_double_int_int_TestExecSpace took 332090 ms which is 5.5 minutes and it timed out while running the individual unit test serial.graph_graph_color_d2_double_int_size_t_TestExecSpace. Looking at other output for a debug build where this test passed but took 356s to run shown here showed that the expensive individual unit were serial.graph_graph_color_d2_double_int_int_TestExecSpace at 142071 ms and serial.graph_graph_color_d2_double_int_size_t_TestExecSpace and 183708 ms. Therefore, I think we need to disable one of these to avoid this timeout. It seems like we should disable serial.graph_graph_color_d2_double_int_int_TestExecSpace since the size_t test should be a larger unsigned int and therefore I think this might be a more important test case to run.

  • This query for KokkosKernels_sparse_openmp_MPI_1 in build Trilinos-atdm-white-ride-gnu-debug-openmp between 6/6/9 and 6/16/2016 shows that this test timed out all 4 times it was run on 'white' and 'ride' over the last 7 days (where it should have run 14 times). The test output shown here shows that the test times out while running the individual unit test openmp.sparse_spgemm_double_int_int_TestExecSpace. But there were two expensive individual unit tests openmp.sparse_block_gauss_seidel_double_int_int_TestExecSpace at 244744 ms (4.07 minutes) and openmp.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace at 215745 ms (3.5 minutes). Looking at another debug Trilinos-atdm-hansen-shiller-intel-debug-openmp on 'hansen' where this unit test executable takes 6m 15s to complete shown here we see the most expensive individual unit tests are openmp.sparse_block_gauss_seidel_double_int_int_TestExecSpace at 94035 ms, openmp.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace at 94580 ms, openmp.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace at 23618 ms, and openmp.sparse_trsv_mv_double_int_size_t_LayoutLeft_TestExecSpace at 29083 ms. These are int and size_t versions of the same test. So let's disable the int versions of these and run the size_t versions in this debug build.

  • This query for KokkosKernels_sparse_serial_MPI_1 in build Trilinos-atdm-white-ride-gnu-debug-openmp between 6/6/9 and 6/16/2016 shows that this test times in all 4 times it was run on 'white' and 'ride' over the last week (when it should have run 14 times). As shown in the test output here, the test timed out while running the individual unit test serial.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace but the other unit test serial.sparse_block_gauss_seidel_double_int_int_TestExecSpace was quite expensive at 426904 ms (7.12 min). Looking at other builds where that test takes a long time to run but does not time out such as the Trilinos-atdm-hansen-shiller-intel-debug-openmp build shown here which finished at 8m 14s it showed the most expensive individual unit tests were serial.sparse_block_gauss_seidel_double_int_int_TestExecSpace at 128795 ms, serial.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace at 133103 ms, serial.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace at 29002 ms, and serial.sparse_trsv_mv_double_int_size_t_LayoutLeft_TestExecSpace at 29028 ms.
    Therefore, let's disable the 'int' versions of these tests and keep the 'size_t' versions. That should let this finish in under 10 minutes easily.

  • This query for KokkosKernels_sparse_serial_MPI_1 in build Trilinos-atdm-hansen-shiller-gnu-debug-serial between 6/6/9 and 6/16/2016 shows that the max test runtime was 7m 900ms over the last week. Therefore, is it not hard to believe that this might timeout due to pinning to the same cores in an unfortunate way. As shown here, that test timed out while running the individual unit test serial.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace. But there were some pretty expensive individual unit tests serial.sparse_block_gauss_seidel_double_int_int_TestExecSpace at 160393 ms (2.67 min) and serial.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace at 224691 ms (3.74 min). Looking at the test output for the output from 6/9/2018 here where this unit test executable did not timeout but took 7m 900ms to complete, we can see that the most expensive individual unit tests were serial.sparse_block_gauss_seidel_double_int_int_TestExecSpace at 105627 ms, serial.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace at 118046 ms, serial.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace at 19836 ms, and serial.sparse_trsv_mv_double_int_size_t_LayoutLeft_TestExecSpace at 19730 ms. Therefore, let's disable the 'int' versions of these tests and keep the 'size_t' versions. That should let this finish in under 10 minutes easily.

I don't think any changes to Trilinos would have impacted these builds on 'white' or 'ride'. What seems to have caused these timeouts to show up today but not the last couple of days is that we did not get test results on 'white' or 'ride' due to the 'bsub' command cashing (like it does about 1/2 of the time and has so for the last 4 months) and the fact that 'ride' was offline for a while.

As for the builds on 'hansen', what changed today is that these builds are now property running on the 'hansen' compute nodes 'hansen02'-'hansen04' using srun instead of the login node 'hansen01' using salloc (see TRIL-211 and commit 22ec935 for details). It must be that the compute nodes on 'hansen' run these tests slower than the login node 'hansen01'. (But we were getting other timeout problems from running on the login node so we had to switch back to use srun instead of salloc).

Now to do a more though search for which KokkosKernels tests might be in trouble over the last week to try to make sure that I identify where tests in which builds are timing out (or very close to timing out) and need to have these individual unit tests disabled.

This query shows all of the KokkosKernels tests for all of the 'debug' builds between 6/9/2018 and 6/15/2018. If you sort by test name then "Proc Time" you see the following:

  • KokkosKernels_blas_cuda_MPI_1 takes no longer than 1m 19s 180ms in any build => Not a problem

  • KokkosKernels_blas_openmp_MPI_1 takes no longer than 3m 46s 350ms in any build => Not a problem

  • KokkosKernels_blas_serial_MPI_1 takes up to 5+m in some builds => Not a problem yet but worth watching

  • KokkosKernels_common_cuda_MPI_1 takes up to 7+ minutes in some builds => Not a problem yet but worth watching

  • KokkosKernels_common_openmp_MPI_1 takes up to 3+ minutes in some builds => Not a problem

  • KokkosKernels_common_serial_MPI_1 takes up to 3+ minutes in some builds => Not a problem

  • KokkosKernels_graph_cuda_MPI_1 takes up to 1+ minutes in some builds => Not a problem

  • KokkosKernels_graph_openmp_MPI_1 takes up to 8+ minutes in some builds => Not a problem yet but worth watching

  • KokkosKernels_graph_serial_MPI_1:

    • Times out every time in the build Trilinos-atdm-white-ride-gnu-debug-openmp => Will have serial.graph_graph_color_d2_double_int_int_TestExecSpace for test KokkosKernels_graph_serial_MPI_1 disabled.
    • Took 7+ minutes in the build Trilinos-atdm-white-ride-cuda-debug today => We should likely disable the expensive individual unit tests for this build as well.
  • KokkosKernels_sparse_cuda_MPI_1 takes up to 4.5+ minutes in some builds => Not a problem yet but worth watching

  • KokkosKernels_sparse_openmp_MPI_1:

    • Trilinos-atdm-white-ride-gnu-debug-openmp: Times out in every build => Will disable unit tests openmp.sparse_block_gauss_seidel_double_int_int_TestExecSpace and openmp.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace
    • 'Trilinos-atdm-hansen-shiller-intel-debug-openmp`: Takes 6m40+s in some builds => Not a problem now but should watch
    • All other builds completed in under 3.5 minutes
  • KokkosKernels_sparse_serial_MPI_1:

    • Trilinos-atdm-white-ride-gnu-debug-openmp: Timed out several times => Will have some unit tests disabled.
    • Trilinos-atdm-hansen-shiller-gnu-debug-serial: Timed out once on 6/15/2018 => Will have some unit tests disabled
    • Trilinos-atdm-white-ride-cuda-debug: Timed out several times => Already had some unit tests disabled
    • Trilinos-atdm-hansen-shiller-intel-debug-serial: Took 9m 14s on 6/11/2018 => Should have the two uni tests disabled to be safe
    • Trilinos-atdm-hansen-shiller-intel-debug-openmp: Took 8m 58s on 6/14/2018 => Should have the two uni tests disabled to be safe
    • Trilinos-atdm-hansen-shiller-cuda-8.0-debug: Took 8m 36s on 6/15/2018 => Should have the two uni tests disabled to be safe
    • Trilinos-atdm-hansen-shiller-cuda-9.0-debug: Took 7m 58s on 6/14/2018 => This is close but let's leave it alone so that we have full testing on CUDA for at least one build (and CUDA 9.0 is more important than CUDA 8.0)

Now let's examine the most expensive Kokkos tests over this time period and look for trouble in this query:

  • KokkosContainers_UnitTest_Serial_MPI_1:

    • Trilinos-atdm-white-ride-cuda-debug: Times out in several builds => Already has the serial.bitset:serial.scatterview unit test disabled
    • Trilinos-atdm-hansen-shiller-gnu-debug-serial: Takes 7m 28s to complete => Okay for now.
    • Every other build completes in 6m 9s or less.
  • KokkosContainers_UnitTest_OpenMP_MPI_1:

    • Trilinos-atdm-white-ride-gnu-debug-openmp: Takes upwards of 9m 30s to complete => Should disable the expensive unit test openmp.UnorderedMap_failed_insert (102188 ms) shown here
  • Every other Kokkos test completes in 5 minutes or less.

So with that analysis complete, I think that we should add the the additional individual unit test disables:

  • KokkosContainers_UnitTest_Serial_MPI_1: Add disable of unit test serial.scatterview for test for build Trilinos-atdm-white-ride-gnu-debug-openmp ...

  • KokkosContainers_UnitTest_OpenMP_MPI_1: Add disable of unit test openmp.UnorderedMap_failed_insert for the build Trilinos-atdm-white-ride-gnu-debug-openmp ...

  • KokkosKernels_graph_serial_MPI_1: Add disable of unit test serial.graph_graph_color_d2_double_int_int_TestExecSpace for test for the build Trilinos-atdm-white-ride-gnu-debug-openmp and Trilinos-atdm-white-ride-cuda-debug ...

  • KokkosKernels_sparse_openmp_MPI_1: Add disables of unit tests openmp.sparse_block_gauss_seidel_double_int_int_TestExecSpace and openmp.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace for build Trilinos-atdm-white-ride-gnu-debug-openmp ...

  • KokkosKernels_sparse_serial_MPI_1: Add disables of individual unit tests serial.sparse_block_gauss_seidel_double_int_int_TestExecSpace and serial.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace for the builds Trilinos-atdm-hansen-shiller-gnu-debug-serial, Trilinos-atdm-hansen-shiller-gnu-debug-serial, Trilinos-atdm-hansen-shiller-intel-debug-serial, Trilinos-atdm-hansen-shiller-intel-debug-openmp, Trilinos-atdm-hansen-shiller-cuda-8.0-debug. Also, add back the unit test serial.sparse_block_gauss_seidel_double_int_size_t_TestExecSpace and remove serial.sparse_trsv_mv_double_int_int_LayoutLeft_TestExecSpace for better test coverage ...

@mhoemmen
Copy link
Contributor

I do wish we would get rid of the int and size_t instantiations, and just have a ptrdiff_t instantiation, but that's a different issue....

@bartlettroscoe
Copy link
Member Author

I do wish we would get rid of the int and size_t instantiations, and just have a ptrdiff_t instantiation, but that's a different issue....

@mhoemmen, Yes! Just one global ordinal type that is guaranteed to be 64 bit on a 64 bit machine and and will be a signed type so you don't need to worry about strange behavior from computing < 0. How do we make that happen?

@crtrott
Copy link
Member

crtrott commented Jun 16, 2018

One word on the test run times: quite a bit of the longer running tests are running that long because we have to catch non-deterministic error sources (e.g. race conditions). There is no 100% reliable way of doing this, but in most cases the likelihood of catching it is some kind of asymptotic thing (e.g. double the runtime catch the next 50% of errors). -O0 and debug kills both all the inlining and adds bounds checking, so every data access gets exorbitantly expensive. So I think the only way of handling this is turning off tests.

@bartlettroscoe
Copy link
Member Author

@crtrott, catching race conditions and other non-deterministic behavior is not the only defects that can exist. There are also off-by-one errors, incorrect memory deallocation and other invalid usage that can be caught with debug-mode runtime checking but may not be caught with a full optimized build (with debug checking disabled). Therefore, it would be good if we could run the entire test suite in a full debug-mode as well to catch those types of errors, but perhaps use smaller arrays and less iterations (i.e. not trying to catch non-deterministic failures, just trying to catch these other types of failures). Could these unit test executables be given some type of command-line argument that could be used to reduce the size of arrays or reduce the number of iterations? That way, a full debug-mode build could run these with reduced cost. Could that be supported? There are just a few problematic tests that would need to be addressed. I think we are losing testing by just disabling tests but at the same time, we can't have individaul tests that take 20+ minutes to run.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 16, 2018
…some debug builds (trilinos#2827)

These very targeted disables should allow these tests to all complete in well
under 10 minutes in all of these debug builds on all of these platforms.  See
the diffs to see exactly what unit tests are disabled in what unit test
executables in what builds on what platforms.  For details on why these are
being disabled, see trilinos#2827.
@mhoemmen
Copy link
Contributor

@bartlettroscoe wrote:

Yes! Just one global ordinal type that is guaranteed to be 64 bit on a 64 bit machine and and will be a signed type so you don't need to worry about strange behavior from computing < 0. How do we make that happen?

Tpetra has explicitly declared its intention to deprecate and remove support all GlobalOrdinal types other than int64_t. This will imply changes to downstream packages, as well as to applications.

Tpetra uses the size_type typedef in a Kokkos::View to determine the offset type in sparse matrix-vector multiply. This typedef is int for CUDA Views, and size_t for all other Views. Sometimes people make an argument for using 32-bit offsets, but the sparse matrix-vector multiply kernel need not and does not do 64-bit integer arithmetic per matrix entry. (ptr[lclRow+1] - ptr[lclRow] should always fit in int, as long as the local number of entries in any row of the matrix fits in int.) Thus, there is no reason to require the offset type to be anything other than a 64-bit type.

The smarter thing for kokkos-kernels to do would be only to instantiate for the type that Tpetra uses. Currently, this is the default offset type, but in the future, we plan to change Tpetra's offset type to int64_t or ptrdiff_t. (The former is easier to understand, but the latter is more semantically appropriate. I'm not sure which to use.)

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 18, 2018
…some debug builds (trilinos#2827)

These very targeted disables should allow these tests to all complete in well
under 10 minutes in all of these debug builds on all of these platforms.  See
the diffs to see exactly what unit tests are disabled in what unit test
executables in what builds on what platforms.  For details on why these are
being disabled, see trilinos#2827.
@bartlettroscoe
Copy link
Member Author

@crtrott, @srajama1, and @ibaned,

What about the idea that I floated above to limit the number the matrix and array sizes and the number of iterations for these tests when CMAKE_BUILD_TYPE=DEBUG and Trilinos_ENABLE_DEBUG=ON? This way, we could still ensure that all of the tests run for all of the use cases for a full debug build to ensure that developers can debug through code? This would allow the tests to run faster for a full debug build and will still run in the same way for optimized builds.

I could perhaps prototype what I am talking about in a PR for a Kokkos and KokksKernels unit test to show you what I am talking about. Does that sound reasonable?

@bartlettroscoe
Copy link
Member Author

There were no failing or timing-out Kokkos or KokkosKernels tests for the past two days as shown in this query. And we can see, for example, the test KokkosKernels_sparse_serial_MPI_1 newly passing two days ago in the Trilinos-atdm-hansen-shiller-gnu-debug-serial build shown here.

This issue is resolved. Closing as compete.

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Jun 21, 2018
@bartlettroscoe
Copy link
Member Author

FYI: it looks like the selective disable of some of the KokkosKernels unit tests in PR #2964 merged to 'develop' on 6/19/2018 did not eliminate all of the timeouts of these tests as shown in this query which showed the timeout:

Site Build Name Test Name Status Time Details Build Time
white Trilinos-atdm-white-ride-gnu-debug-openmp KokkosKernels_sparse_openmp_MPI_1 Failed 10m 40ms Completed (Timeout) 2018-06-25T06:12:32 UTC

But since this is just one timeout, we should leave this for now and see if this is a recurring problem.

Again, random failures like that across all of the various packages and builds will add up and cause automated processes to update Trilinos versions to 'master' or the ATDM APP Trilinos mirror repos to fail more frequently. Therefore, we have to be on top of every randomly failing test in every package in every ATDM build.

fryeguy52 added a commit to fryeguy52/Trilinos that referenced this issue Oct 3, 2018
fryeguy52 added a commit to fryeguy52/Trilinos that referenced this issue Oct 8, 2018
@bartlettroscoe bartlettroscoe added the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Intrepid2 pkg: Kokkos pkg: KokkosKernels pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

10 participants