Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeuchosCore_testTeuchosTestForTermination_X_MPI_4 tests randomly timing out/hanging in new cee-rhel6 openmpi-4.0.1 builds starting 2020-01-04 #6532

Closed
bartlettroscoe opened this issue Jan 7, 2020 · 3 comments
Labels
ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jan 7, 2020

CC: @trilinos/teuchos, @bartlettroscoe

Next Action Status

Likely due to a defect in this OpenMPI 4.0.1 install not correctly terminating MPI job when one of the ranks is terminated. Next: Let run for a few weeks to get some more data then decide what to do.

Description

As shown in this query the tests:

  • TeuchosCore_testTeuchosTestForTermination_0_MPI_4
  • TeuchosCore_testTeuchosTestForTermination_2_MPI_4
  • TeuchosCore_testTeuchosTestForTermination_3_MPI_4

in the new builds:

  • Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-4.0.1_serial_static_opt
  • Trilinos-atdm-cee-rhel6_gnu-7.2.0_openmpi-4.0.1_serial_shared_opt

are randomly timing out.

These are most likely hanging when one of the processes calls abort but it seems that OpenMPI 4.0.1 is not properly terminating the other processes.

Current Status on CDash

Steps to Reproduce

One should be able to reproduce this failure on any CEE RHEL6 or RHEL7 machine as described in:

More specifically, the commands given for the system 'cee-rhel6' are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh <build-name>

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teuchos=ON \
 $TRILINOS_DIR

$ make NP=16

$ ctest -j12
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Teuchos Issues primarily dealing with the Teuchos Package client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates PA: Framework Issues that fall under the Trilinos Framework Product Area labels Jan 7, 2020
@bartlettroscoe bartlettroscoe added the ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos label Jan 7, 2020
@bartlettroscoe
Copy link
Member Author

I added the label ATDM Env Issue because I believe this is a defect in this OpenMPI 4.0.1 installation in not aborting other ranks when one of the ranks aborts.

I will let this test keep running in these builds so that we can get some more statistics.

@bartlettroscoe
Copy link
Member Author

Also, I consider this to be a nonblocking issue since as long as your program does not terminate on a process this all is good. And if it does terminate on a process, the MPI job will just hang (until the timeout is reached).

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jan 7, 2020
I triaged the one randomly faling test in trilinos#6532.  I think it is appropriate to
promote these builds given the other builds in the ATDM group are not 100%
clean (and no one seems to care that much).
@bartlettroscoe bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Feb 11, 2020
@bartlettroscoe
Copy link
Member Author

As shown in this query these tests started passing when the 'cee-rhel6' builds were switched from using 'openmpi-4.0.1' to 'openmpi-4.0.2' on testing day 2020-01-16. Looks like that version of OpenMPI fixed the behavior of MPI_Abort() :-)

Therefore, this is fixed. Closing as fixed :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

1 participant