Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanzerAdaptersIOSS_tIOSSConnManager tests failing in ATDM builds cee-rhel6 builds #3632

Closed
fryeguy52 opened this issue Oct 15, 2018 · 31 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

fryeguy52 commented Oct 15, 2018

CC: @trilinos/panzer , @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe

Next Action Status

EMPIRE works just fine against these 'cee-rhel6' builds (see TRIL-242) so tests failing tests are not indicative of any problems for EMPIRE. With the merge of PR #4079 to 'develop' on 12/19/2018, these tests are now be disabled in the 'cee-rhel6' builds are were shown to be missing on 12/20/2018.

Description

As shown in this query the tests:

  • PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3
  • PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2

are failing in the builds:

  • Trilinos-atdm-cee-rhel6-gnu-opt-serial
  • Trilinos-atdm-cee-rhel6-intel-opt-serial
  • Trilinos-atdm-cee-rhel6-clang-opt-serial

Current Status on CDash

To see the current status of these tests on CDash, click on the below link:

NOTES:

  • Click on 'Status' twice to sort all of the currently 'Failed' tests to the top
  • Click 'Previous' to see status for prior days, etc.

Steps to Reproduce

One should be able to reproduce this failure on any CEE LAN RHEL6 SRN as described in:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cee-rhel6-gnu-opt-serial

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Panzer=ON \
  $TRILINOS_DIR

$ make NP=16

$ ctest -j16
@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: Panzer client: ATDM Any issue primarily impacting the ATDM project labels Oct 15, 2018
@bartlettroscoe
Copy link
Member

@gsjaardema, it looks like the behavior of SEACAS/Exodus changes when using the custom FindNetcdf.cmake module that you wrote. It is changing the nodal IDs. See above and here. I can provide more details.

I talked with @rppawlo and he said that if this is not something that you can help fix, then he is okay with just disabling these tests in ATDM Trilinos builds.

@bartlettroscoe
Copy link
Member

@rppawlo, just to confirm with you, since none of the ATDM APP are using the code in PanzerAdaptersIOSS, can we just disable these failing tests in these 'cee-rhel6' builds?

But I am still concerned that the SPARC way of confugring Trilinos/SEACAS using the magic FindNetCDF.cmake module will break EMPIRE's usage of this Trilinos configuration.

@gsjaardema
Copy link
Contributor

I will try to take a look in Monday.

@rppawlo
Copy link
Contributor

rppawlo commented Oct 29, 2018

yes - fine to disable, though am hoping @gsjaardema can work this out today.

@bartlettroscoe
Copy link
Member

@rppawlo,

yes - fine to disable, though am hoping @gsjaardema can work this out today.

Okay, let's wait to see if @gsjaardema can get to the bottom of this since I fear this might break EMPIRE.

@bartlettroscoe
Copy link
Member

FYI: I passed info to @bathmatt to test out EMPIRE to see if it has any new failing tests related to Exodus with this different SEACAS/Exodus NetCDF configuration.

@gsjaardema
Copy link
Contributor

Question for whoever knows -- it looks like we are using a very old version of NetCDF here -- 4.4.0 even though Sparc has a newer version of NetCDF available -- 4.6.1. Is there a valid reason for using the old version. There have been many bugs fixed and enhancements added from 4.4.0 to 4.6.1.

@gsjaardema
Copy link
Contributor

@bartlettroscoe what is meant by "magic FindNetCDF.cmake" ? What is the "non-magic" method and is one better than the other?

@bartlettroscoe
Copy link
Member

What is the "non-magic" method and is one better than the other?

@gsjaardema, the "non-magic" method is just a raw listing of header files and libraries as shown in:

which is what the EMPIRE Trilinos configuration does.

Note that if you switch to using that approach, these Panzer tests pass but some SPARC tests fail.

As for the version of NetCDF, we need to consult with @micahahoward and @sebrowne.

@gsjaardema
Copy link
Contributor

@rppawlo It looks like the failing tests are using a pamgen-generated mesh with no exodus input/output. If that truly is the case, then I am confused as to why a different NetCDF configuration process would affect the testing results since there should be no NetCDF functions being called at all during the testing.

I have verified that nc_open and nc_create (and their parallel counterparts) are not being called and no Exodus-related functions are being called.

Not sure what is the issue yet, but just making sure I was not missing something on the tests that were being run.

@gsjaardema
Copy link
Contributor

@bartlettroscoe Question -- in the configuration section above, we use cee-rhel6-gnu-opt-serial. What does the serial in that string represent? It looks like parallel tests are being run, so I am confused about what the serial means.

@bartlettroscoe
Copy link
Member

What does the serial in that string represent?

@gsjaardema, as explained at:

it means to use the Kokkos serial threading model.

@gsjaardema
Copy link
Contributor

@bartlettroscoe RE: non-magic building.

How do I do a build on a cee-rhel6 machine using the "non-magic" build configuration?

@rppawlo
Copy link
Contributor

rppawlo commented Oct 29, 2018

@gsjaardema - that's part of our confusion. A change to the detection of netcdf should not change the numbering of this test. I suspect that the FindNetcdf module is defining a cmake flag that may change a define in how ioss does numbering. Can you point me to the FindNetcfd code exists?

@gsjaardema
Copy link
Contributor

The FindNetcdf.cmake module is in cmake/tribits/common_tpls/find_modules/FindNetCDF.cmake it determines how the NetCDF library was built and defines a few symbols:

  • NetCDF_NEEDS_HDF5
  • NetCDF_NEEDS_PNetCDF
  • NetCDF_PARALLEL
  • NetCDF_INCLUDE_DIRS
  • NetCDF_LIBRARIES
  • NetCDF_BINARIES

My hypothesis so far is that the differences have something to do with the NetCDF_PARALLEL setting, and I am looking into that possibility currently...

@bartlettroscoe
Copy link
Member

How do I do a build on a cee-rhel6 machine using the "non-magic" build configuration?

@gsjaardema, you would just have to edit your local copy of Trilinos and change the file cmake/std/atdm/ATDMDevEnvSettings.cmake to use the non-SPARC way of pulling in NetCDF. I can create a topic branch with a cache var that allows you to toggle that if it would help.

@gsjaardema
Copy link
Contributor

@bartlettroscoe I thought I could handle the non-magic build, but am unable to get it to pass the tests, so I must be doing it wrong. If you could create a topic branch with a cache var for me to use, that would be appreciated.

@bartlettroscoe
Copy link
Member

I thought I could handle the non-magic build, but am unable to get it to pass the tests, so I must be doing it wrong. If you could create a topic branch with a cache var for me to use, that would be appreciated.

@gsjaardema, okay, let me create the topic branch and test to make sure it is doing the right thing then I will push and point to it.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 30, 2018
This allows you to switch to the EMPIRE way of pulling in the HDF5 and Netcdf
TPLs.  This was added to aid in the debugging of apparent changing of behavior
of SEACAS when using the SPARC way vs. th EMPIRE way of specifiying the HDF5
and Netcdf TPLs (see trilinos#3632).
@bartlettroscoe
Copy link
Member

@gsjaardema, I created the PR #3632 that provides the toggle ATDM_CONFIG_USE_SPARC_TPL_FIND_SETTINGS. To build with the EMPIRE way of pulling in HDF5 and Netcdf, set in the env or the CMake cache as ATDM_CONFIG_USE_SPARC_TPL_FIND_SETTINGS=OFF.

Interestingly, these tests failed with that configuration as well. That is not my memory but I may be mistaken. Looking at the history for the tests PanzerAdaptersIOSS_tIOSSConnManagerXXX shown here, these tests run and pass in PR builds and various other nightly builds.

What is it about this SPARC env and TPLs that is causing these tests to fail? It seems it is not the SPARC way of using the custom FindNetCDF.cmake module you wrote after all.

trilinos-autotester added a commit that referenced this issue Oct 30, 2018
…empire-netcdf-hdf5-config

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Add env var ATDM_CONFIG_USE_SPARC_TPL_FIND_SETTINGS (#3632)
PR Author: bartlettroscoe
masterleinad pushed a commit to masterleinad/Trilinos that referenced this issue Nov 9, 2018
This allows you to switch to the EMPIRE way of pulling in the HDF5 and Netcdf
TPLs.  This was added to aid in the debugging of apparent changing of behavior
of SEACAS when using the SPARC way vs. th EMPIRE way of specifiying the HDF5
and Netcdf TPLs (see trilinos#3632).
@bartlettroscoe
Copy link
Member

@gsjaardema, this issue about the PanzerAdaptersIOSS tests may not be related to the SEACASIoss_Utst_structured_decomp test in #3891 but it would be good to figure out why we are seeing different behavior depending on how we pull in the NetCDF and HDF5 TPLs as it impacts this test. If we could figure that out, then we could switch back to explicitly setting the include directories libraries for all of these TPLs and we could eliminate tricky find module behavior.

@jbcarleton
Copy link
Contributor

Does using different versions of pnetcdf produce different mesh decompositions? If so, these tests should fail, since they are tied to a particular decomposition.

@gsjaardema
Copy link
Contributor

@jbcarleton. The version of PnetCDF should have no affect on the decomposition

@bartlettroscoe
Copy link
Member

@gsjaardema, any idea what could be causing the different behavior of SEACAS with these TPLs? How can we go about debugging this? NOTE: We should hopefully find out if this will also impact EMPIRE in the next few days.

@bartlettroscoe bartlettroscoe added the PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area label Nov 29, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
This allows you to switch to the EMPIRE way of pulling in the HDF5 and Netcdf
TPLs.  This was added to aid in the debugging of apparent changing of behavior
of SEACAS when using the SPARC way vs. th EMPIRE way of specifiying the HDF5
and Netcdf TPLs (see trilinos#3632).
@mperego
Copy link
Contributor

mperego commented Dec 10, 2018

@rppawlo It seems there is not enough momentum on this issue.. should we disable the test on the 'cee-rhel6' builds?

@bartlettroscoe
Copy link
Member

@mperego said

@rppawlo It seems there is not enough momentum on this issue.. should we disable the test on the 'cee-rhel6' builds?

FYI: I have been waiting to run the EMPIRE builds against this 'cee-rhel6' configuration to see if the failure in this test might indicate a change in behavior or SEACAS that would break SPARC.

@rppawlo
Copy link
Contributor

rppawlo commented Dec 10, 2018

its fine to disable

@bartlettroscoe
Copy link
Member

FYI: As documented in TRIL-242, I verified that after the tweak to the 'cee-rhel6' SPARC ATDM Trilinos configuration in PR #4054 is merged to 'develop', then EMPIRE builds and runs all of its tests just fine.

Therefore, it seems that these failing PanzerAdaptersIOSS_tIOSSConnManagerXXX tests don't indicate a problem with these 'cee-rhel6' configurations for EMPIRE. Therefore, we can safely disable these tests in the cee-rhel6 builds.

@bartlettroscoe
Copy link
Member

With the merge of PR #4079 to 'develop' on 12/19/2018, these tests should now be disabled in the 'cee-rhel6' builds.

In fact, we already can see that these tests are missing in some 'cee-rhel6' builds as shown, for example, in the build Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt today.

Unfortunately, due to the crashes of the Trilinos autotester, PR #4079 did not merge until after the first 'cee-rhel6' build ran so these tests still failed today as shown here.

@bartlettroscoe bartlettroscoe added stage: in review Primary work is completed and now is just waiting for human review and/or test feedback Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue labels Dec 19, 2018
@bartlettroscoe
Copy link
Member

Looks like these have all been disabled as shown in the table below (with data taken from CDash)

Adding the "Disabled Tests" label to filter out of our main queries.

@rppawlo, do you want to keep this issue open with the "Disabled Tests" label or just close it? If there are no plans to try to fix this anytime soon, we might as well close this in my opinion. We need to leave the "Disabled Tests" label on this so we can find it if we want to but otherwise could close.


Tests with issue trackers Missing: twim=16 (On 2018-12-20<)

Site Build Name Test Name Status Details Consec­utive Missing Days Non-pass Last 30 Days Pass Last 30 Days Tracker
cee-rhel6 Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager2_­MPI_­2 Missing Missing 1 29 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager2_­MPI_­2 Missing Missing 2 28 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager2_­MPI_­2 Missing Missing 2 28 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-intel-17.0.1-intelmpi-5.1.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager2_­MPI_­2 Missing Missing 2 28 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager2_­MPI_­2 Missing Missing 2 28 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager3_­MPI_­3 Missing Missing 1 29 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager3_­MPI_­3 Missing Missing 2 28 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager3_­MPI_­3 Missing Missing 2 28 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-intel-17.0.1-intelmpi-5.1.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager3_­MPI_­3 Missing Missing 2 28 0 #3632
cee-rhel6 Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt PanzerAdaptersIOSS_­tIOSSConnManager3_­MPI_­3 Missing Missing 2 28 0 #3632

@rppawlo
Copy link
Contributor

rppawlo commented Dec 20, 2018

Fine with closing. It's priority was dropped and we will not address anytime soon.

@bartlettroscoe
Copy link
Member

Fine with closing. It's priority was dropped and we will not address anytime soon.

Closing. Thanks!

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Dec 20, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 10, 2019
Don't know why the trigger of turning on extra stuff causes these tests to
fail but it was determined that fixing these is not worth it so we disable
them.  See trilinos#3632.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 11, 2019
Don't know why the trigger of turning on extra stuff causes these tests to
fail but it was determined that fixing these is not worth it so we disable
them.  See trilinos#3632.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 12, 2019
Don't know why the trigger of turning on extra stuff causes these tests to
fail but it was determined that fixing these is not worth it so we disable
them.  See trilinos#3632.
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Apr 13, 2019
…s:develop' (7db7806).

* trilinos-develop: (23 commits)
  Fix cmake-file error in stk_balance that was making the m2n exe be a test.
  tpetra:  minor fix; return the values
  Fix incorrect line length in copy_string change
  Automatic snapshot commit from seacas at f9bf59a
  SEACAS: cgns - support self-looping models
  Disable failing ROL test already known to fail in CUA builds (trilinos#3543)
  Disable known failing Panzer tests (trilinos#3632)
  Small formatting change to comment (trilinos#3939)
  Enable SPARC TPLs and packages on 'waterman' (ATDV-151)
  ShyLU/FROSch: Correct use of booleans for interface components
  Don't allow Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt to run on 'ride7' (ATDV-155)
  tpetra:  minor additional deprecations trilinos#4839
  MiniEM: Fix discrete gradient
  tpetra:  changes to address Mark's comments on trilinos#4839
  Xpetra: MueLu: fix issue 4038
  ShyLU/FROSch: Use insertGlobalValues instead of insertLocalValues for GlobalCoarseMatrix
  stokhos:  fix compilation error due to tpetra deprecation changes
  Thyra:  fixed compilation error due to deprecation changes
  tpetra:  More deprecations of function arguments involving Node. create*MapWithNode generate_miniFM_*
  Tpetra:  removing Node from argument lists of functions Completed MatrixMarket_Tpetra functions (readSparse, readDense, etc.) Also removed a few compiler warnings reported in clang
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Apr 14, 2019
…s:develop' (7db7806).

* trilinos-develop: (30 commits)
  Fix cmake-file error in stk_balance that was making the m2n exe be a test.
  Tpetra: Global Ordinal validation
  tpetra:  minor fix; return the values
  Fix incorrect line length in copy_string change
  Tpetra: Moved GORDS logic to right file this time, really.
  Tpetra: GORDS Deprecation Cleanup
  Tpetra: Relocated # GORDS validation logic to packages/tpetra/core/CMakeLists.txt
  Tpetra: clean up deprecation WIP tags
  Tpetra: Add deprecations for global ordinal types
  Automatic snapshot commit from seacas at f9bf59a
  SEACAS: cgns - support self-looping models
  Disable failing ROL test already known to fail in CUA builds (trilinos#3543)
  Disable known failing Panzer tests (trilinos#3632)
  Small formatting change to comment (trilinos#3939)
  Enable SPARC TPLs and packages on 'waterman' (ATDV-151)
  ShyLU/FROSch: Correct use of booleans for interface components
  Don't allow Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt to run on 'ride7' (ATDV-155)
  tpetra:  minor additional deprecations trilinos#4839
  Ifpack2 - fix issue 4858
  MiniEM: Fix discrete gradient
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Apr 15, 2019
…s:develop' (7db7806).

* trilinos-develop: (30 commits)
  Fix cmake-file error in stk_balance that was making the m2n exe be a test.
  Tpetra: Global Ordinal validation
  tpetra:  minor fix; return the values
  Fix incorrect line length in copy_string change
  Tpetra: Moved GORDS logic to right file this time, really.
  Tpetra: GORDS Deprecation Cleanup
  Tpetra: Relocated # GORDS validation logic to packages/tpetra/core/CMakeLists.txt
  Tpetra: clean up deprecation WIP tags
  Tpetra: Add deprecations for global ordinal types
  Automatic snapshot commit from seacas at f9bf59a
  SEACAS: cgns - support self-looping models
  Disable failing ROL test already known to fail in CUA builds (trilinos#3543)
  Disable known failing Panzer tests (trilinos#3632)
  Small formatting change to comment (trilinos#3939)
  Enable SPARC TPLs and packages on 'waterman' (ATDV-151)
  ShyLU/FROSch: Correct use of booleans for interface components
  Don't allow Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt to run on 'ride7' (ATDV-155)
  tpetra:  minor additional deprecations trilinos#4839
  Ifpack2 - fix issue 4858
  MiniEM: Fix discrete gradient
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

6 participants