TeuchosCore_testTeuchosTestForTermination_X_MPI_4 tests randomly timing out/hanging in new cee-rhel6 openmpi-4.0.1 builds starting 2020-01-04 #6532
Labels
ATDM Env Issue
Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos
ATDM Sev: Nonblocker
Problems with Trilinos that should not block ATDM APPs from getting updates
client: ATDM
Any issue primarily impacting the ATDM project
impacting: tests
The defect (bug) is primarily a test failure (vs. a build failure)
PA: Framework
Issues that fall under the Trilinos Framework Product Area
pkg: Teuchos
Issues primarily dealing with the Teuchos Package
type: bug
The primary issue is a bug in Trilinos code or tests
Milestone
CC: @trilinos/teuchos, @bartlettroscoe
Next Action Status
Likely due to a defect in this OpenMPI 4.0.1 install not correctly terminating MPI job when one of the ranks is terminated. Next: Let run for a few weeks to get some more data then decide what to do.
Description
As shown in this query the tests:
TeuchosCore_testTeuchosTestForTermination_0_MPI_4
TeuchosCore_testTeuchosTestForTermination_2_MPI_4
TeuchosCore_testTeuchosTestForTermination_3_MPI_4
in the new builds:
Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-4.0.1_serial_static_opt
Trilinos-atdm-cee-rhel6_gnu-7.2.0_openmpi-4.0.1_serial_shared_opt
are randomly timing out.
These are most likely hanging when one of the processes calls abort but it seems that OpenMPI 4.0.1 is not properly terminating the other processes.
Current Status on CDash
Steps to Reproduce
One should be able to reproduce this failure on any CEE RHEL6 or RHEL7 machine as described in:
More specifically, the commands given for the system 'cee-rhel6' are provided at:
The exact commands to reproduce this issue should be:
The text was updated successfully, but these errors were encountered: