Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link failure for test SEACASIoss_Utst_structured_decomp.exe in Trilinos-atdm-cee-rhel6-intel-opt-serial starting 11/3/2018 #3891

Closed
bartlettroscoe opened this issue Nov 17, 2018 · 16 comments
Labels
ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project client: SPARC Issues related to or needed more specifically by the ATDM SPARC code Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: seacas type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Nov 17, 2018

CC: @trilinos/seacas , @kddevin (Trilinos Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Decide what to do with this failing test.

Description

As shown in this query the executable SEACASIoss_Utst_structured_decomp.exe started to fail to link in the build Trilinos-atdm-cee-rhel6-intel starting on 11/3/2018. This in turn cased the test defined using this executable SEACASIoss_Utst_structured_decomp_MPI_1 to be not run.

The link failure is shown here which shows:

/projects/sparc/tpls/cee-rhel6-new/cgns-develop/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/libcgns.a(ADFH.c.o): In function `children_ids':
ADFH.c:(.text+0x18b): undefined reference to `H5Gopen2'
ADFH.c:(.text+0x1c3): undefined reference to `H5Gclose'
/projects/sparc/tpls/cee-rhel6-new/cgns-develop/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/libcgns.a(ADFH.c.o): In function `compare_children':
ADFH.c:(.text+0x1f4): undefined reference to `H5Gget_objinfo'
/projects/sparc/tpls/cee-rhel6-new/cgns-develop/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/libcgns.a(ADFH.c.o): In function `get_str_att':
ADFH.c:(.text+0x26c): undefined reference to `H5Aopen_name'
ADFH.c:(.text+0x28c): undefined reference to `H5Aiterate2'
ADFH.c:(.text+0x2e8): undefined reference to `H5Aget_type'
ADFH.c:(.text+0x301): undefined reference to `H5Aread'
ADFH.c:(.text+0x30b): undefined reference to `H5Tclose'
ADFH.c:(.text+0x312): undefined reference to `H5Aclose'
ADFH.c:(.text+0x4b7): undefined reference to `H5Aclose'
...

The new commits that were pulled the day that these failures started are show, for example, here. Looking over those commits there does not seem to be any that could impact either that ATDM Trilinos configuration or the SEACAS package itself. And there does not seem to have been an env change in the HDF5 libs that could have triggered this link failure (more on that in a later comment).

Current Status on CDash

As shown in this query, the build Trilinos-atdm-cee-rhel6-intel was (prematurely) disabled on 11/11/2018 and therefore this failure can not be seen on the current CDash site (but I did reproduce this failure locally while working on #3871 so this build error still exists).

Steps to Reproduce

One should be able to reproduce this failure on any CEE RHEL6 machine using the 'cee-rhel6' env as described in:

More specifically, the commands given for the s 'cee-rhel6' env are provided at:

The exact commands to reproduce this build error should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cee-rhel6-intel-opt-serial

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_SEACAS=ON \
 -DSEACASIoss_Utst_structured_decomp_EXE_DISABLE=OFF \
 -DSEACASIoss_Utst_structured_decomp_DISABLE=OFF \
 $TRILINOS_DIR

$ make NP=16
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: seacas client: ATDM Any issue primarily impacting the ATDM project client: SPARC Issues related to or needed more specifically by the ATDM SPARC code labels Nov 17, 2018
@bartlettroscoe
Copy link
Member Author

FYI: Note that as shown in this CDash query this test SEACASIoss_Utst_structured_decomp_MPI_1 only ever gets built and ran as part of the ATDM Trilinos 'cee-rhel6' builds. (That suggests that it must depend on CGNS which is not enabled in any other Trilinos build submitting results to the Trilinos CDash site.) And as you can see in that CDash query, that test is currently only running the the build Trilinos-atdm-cee-rhel6-gnu-4.9.3-opt-serial, where it is passing in 4 sec.

@bartlettroscoe
Copy link
Member Author

Looking at what could have triggered this build failure, the new commits that were pulled on 11/3/2018 that these failures started are show, for example, here. Looking over those commits there does not seem to be any that could impact either that ATDM Trilinos configuration or the SEACAS package itself. The only commit with any hope of impacting SEACAS was d20b710:

d20b71032e:  Fix nvidia compile issue in stk_search
Author: Alan Williams <[email protected]>
Date:   Fri Nov 2 20:08:10 2018 -0600

M	packages/stk/stk_search/stk_search/Point.hpp

but those changes look completely localized to the function implementations in file and unable to case HDF5 link failures.

Looking for env changes, the HDF5 libd uses on 11/2/2018 were shown here showed:

-- HDF5 Version: 1.8.20
-- 	HDF5_INCLUDE_DIRS      =/projects/sparc/tpls/cee-rhel6-new/hdf5-1.8.20/cee-cpu_intel-17.0.1_intelmpi-5.1.2/include

and the HDF5 libs used on 11/3/2018 shown here showed:

-- HDF5 Version: 1.8.20
-- 	HDF5_INCLUDE_DIRS      =/projects/sparc/tpls/cee-rhel6-new/hdf5-1.8.20/cee-cpu_intel-17.0.1_intelmpi-5.1.2/include

So the directory is the same. And the libraries seemed to have been last touched on since 9/28/2018 as shown by:

$ ls -l /projects/sparc/tpls/cee-rhel6-new/hdf5-1.8.20/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/
total 8800
-rw-r----- 1 sparc wg-aero-usr 8153214 Sep 28 17:34 libhdf5.a
-rw-r----- 1 sparc wg-aero-usr  420468 Sep 28 17:34 libhdf5_fortran.a
-rwxr-x--- 1 sparc wg-aero-usr    1072 Sep 28 17:34 libhdf5_fortran.la
-rw-r----- 1 sparc wg-aero-usr  246200 Sep 28 17:34 libhdf5_hl.a
-rw-r----- 1 sparc wg-aero-usr  112122 Sep 28 17:34 libhdf5hl_fortran.a
-rwxr-x--- 1 sparc wg-aero-usr    1285 Sep 28 17:34 libhdf5hl_fortran.la
-rwxr-x--- 1 sparc wg-aero-usr    1057 Sep 28 17:34 libhdf5_hl.la
-rwxr-x--- 1 sparc wg-aero-usr     950 Sep 28 17:34 libhdf5.la
-rw-r----- 1 sparc wg-aero-usr    2556 Sep 28 17:34 libhdf5.settings

And as shown in this query, this test was running and passing just fine in the Trilinos-atdm-cee-rhel6-intel-opt-serial build from 10/12/2018 through 11/2/2018 so it is not like this test just started running on 11/3/2018.

I am stumped as to what could be causing this test executable to stop linking.

@bartlettroscoe
Copy link
Member Author

@gsjaardema, given that this test is building and passing in the other 'cee-rhel6' builds, I would like to disable this one test for now so that I can get the 'cee-rhel6' builds updated as part of #3871. I will provide fresh instructions for re-enabling this test locally in case you or someone else wants to try to fix this for the 'cee-rhel6' Intel builds.

@gsjaardema
Copy link
Contributor

My guess is that the CGNS library is not correctly adding an HDF5 dependency. The test is only enabled if TPL_ENABLE_CGNS is set. It also doesn't have any undefined CGNS symbols, so it must be linking with libcngs ok, but somehow the CGNS dependency on HDF5 is not being captured...

I think that there may be a TriBits PR that I submitted awhile ago that improved the CGNS find library, but not sure on that... I think SEACAS has a different find cgns that correctly finds the dependency...

But, fine to disable for now.

@gsjaardema
Copy link
Contributor

Don't see the TriBits PR, so I guess I never submitted it. However, it seems like the normal CGNS find library should get the dependency...

@bartlettroscoe
Copy link
Member Author

@gsjaardema said:

I think that there may be a TriBits PR that I submitted awhile ago that improved the CGNS find library, but not sure on that... I think SEACAS has a different find cgns that correctly finds the dependency...

We can look into that but does that explain how it went from building to not building? Also, the other non-intel builds have this building just fine. Very strange.

But, fine to disable for now.

Thanks. The other builds should will protect this functional fairly well.

Does this test represent a capability that SPARC uses that is not covered in other SEACAS tests?

@gsjaardema
Copy link
Contributor

Does this test represent a capability that SPARC uses that is not covered in other SEACAS tests?

Yes, it is a test that should be run, but since it is run on all SEACAS builds, it should be OK to disable in Trilinos for now. Not sure why other builds are succeeding with CGNS library, but not this one,...

@bartlettroscoe
Copy link
Member Author

FYI: I don't see that the NetCDF lib was updated either around the time this link failure started to occur. The NetCDF libs seems to have been static since 9/28/2018 as shown by:

 ls -l /projects/sparc/tpls/cee-rhel6-new/netcdf-4.6.1/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/
total 1600
-rwxr-x--- 1 sparc wg-aero-usr    1482 Sep 28 17:47 libbzip2.la
-rwxr-x--- 1 sparc wg-aero-usr    1478 Sep 28 17:47 libmisc.la
-rw-r----- 1 sparc wg-aero-usr 1608316 Sep 28 17:47 libnetcdf.a
-rwxr-x--- 1 sparc wg-aero-usr    1513 Sep 28 17:47 libnetcdf.la
-rw-r----- 1 sparc wg-aero-usr    1340 Sep 28 17:47 libnetcdf.settings
drwxr-s--- 2 sparc wg-aero-usr    4096 Sep 28 17:47 pkgconfig

And the CGNS lib seems to have been static since 10/19/2018 as shown by:

$ ls -l /projects/sparc/tpls/cee-rhel6-new/cgns-develop/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/
total 2328
-rw-r----- 1 sparc wg-aero-usr 2367558 Oct 19 09:20 libcgns.a

So, I can't find any changes in TriBITS or in the ATDM Trilinos configuration or in the Trilinos packages impacting SEACAS or in installed libs that impact SEACAS. I can't understand what could cause this except for things moving around in memory somehow and changing the behavior of the linker.

I guess that is the next thing to do ... carefully examine the link line and examine the object files and the libs with 'nm' to see if the right symbols should be found.

@gsjaardema
Copy link
Contributor

I think that maybe we don't see the HDF5 missing symbols in other executables using CGNS is that the other executables also probably use NetCDF which has an (optional) HDF5 dependency depending on how it is built. TheUtst_structured_decomp only depends on CGNS, so is an anomaly that is catching the missing dependency.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Nov 17, 2018
…or cee-rhel6 intel builds (trilinos#3891)

There is currently a strange link error.  Only this one test execuable is
impacted.

This test links and runs just fine in the other 'cee-rhel6' builds so
disabling it for now in these Intel builds is not so terrible and this can
still be fixed offline.  By disabling it for now, we remove a lot of red spam
from the CDash site and the ATDM Trilinos summary emails (see trilinos#2933).
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Nov 17, 2018
…or cee-rhel6 intel builds (trilinos#3891)

There is currently a strange link error.  Only this one test execuable is
impacted.

This test links and runs just fine in the other 'cee-rhel6' builds so
disabling it for now in these Intel builds is not so terrible and this can
still be fixed offline.  By disabling it for now, we remove a lot of red spam
from the CDash site and the ATDM Trilinos summary emails (see trilinos#2933).
@bartlettroscoe
Copy link
Member Author

FYI: I disabled this test in commit fe5fb4a as part of PR #3871 merged to 'develop' on 11/18/2018. Therefore, it is not shown failing in the updated 'cee-rhel6' builds noted here.

I am putting on the label "Disabled Tests" to get this off of our main board.

@gsjaardema, please let me know if this is something you want to look into fixing. I can provide the exact commands needed to reproduce this on any CEE RHEL6 machine.

@bartlettroscoe bartlettroscoe added the Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue label Nov 20, 2018
searhein pushed a commit to searhein/Trilinos that referenced this issue Nov 26, 2018
…ix-rgdsw

* 'develop' of https:/searhein/Trilinos: (108 commits)
  Panzer: Adapted CurlLaplacianExample into a mixed version that uses both HCurl and HDiv elements in 3D or HCurl and HVol elements in 2D.
  Ctest: Try enabling Fortran on rocketman.
  Ctest: Gemina build fixes?
  Issue 3832: Added lines for GCC 7.3.0
  Testing: specify blas/lapack in enigma scripts
  Ctest: Geminga Tpetra Experimental fix
  Ctest: Fixing AMGx build Galeri configure error
  Intrepid2: Increased tolerance for test InterpolationProjection_HEX to address Issue trilinos#3879
  Ifpack2::ILUT::setParameters: Fix trilinos#3903
  MueLu: testing: revive enigma testing
  Xpetra: Removing code that nobody understands, but isn't right
  Tpetra: Fix trilinos#3898 (unused typedefs)
  Xpetra: Having MatrixFactory2::BuildCopy() copy strided map status (and adding test)
  Removing test that fails repeatedly on all platforms
  MueLu/HHG: form composite coarse operator (trilinos#2798)
  Add safety check, fix typo
  Disable test SEACASIoss_Utst_structured_decomp_MPI_1 and exec build for cee-rhel6 intel builds (trilinos#3891)
  Add support for ctest-s-local-test-driver.sh (TRIL-212)
  Add support for <system_name>/custom_builds.sh, update cee-rhel6 builds (TRIL-212)
  Fix running srun on shiller (TRIL-212)
  ...
@gsjaardema
Copy link
Contributor

@bartlettroscoe Yes, I would like to look into fixing this. Please let me know how to reproduce on a CEE RHEL6 machine

@bartlettroscoe
Copy link
Member Author

@gsjaardema said:
@gsjaardema said:

Yes, I would like to look into fixing this. Please let me know how to reproduce on a CEE RHEL6 machine

I updated the "Steps to Reproduce" above to compensate for the disables I added in PR #3871 in commit fe5fb4a. Once this is fixed, we can just revert that one commit in a new PR.

@gsjaardema
Copy link
Contributor

As I suspected above, the issue is that the CMake code that is finding the CGNS library (cmake/tribits/common_tpls/FindTPLCGNS.cmake) is not correctly setting the dependency of libcgns.a on libhdf5.a. If I edit the CMakeCache.txt and add the libhdf5.a dependency to TPL_CNGS_LIBRARIES:

TPL_CGNS_LIBRARIES:FILEPATH=/projects/sparc/tpls/cee-rhel6-new/cgns-develop/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/libcgns.a;/projects/sparc/tpls/cee-rhel6-new/hdf5-1.8.20/cee-cpu_intel-17.0.1_intelmpi-5.1.2/lib/libhdf5.a

Then everything builds correctly with no unresolved symbols. The CGNS HDF5 dependency is optional, but I know of noone who uses it without HDF5 these days. I can provide the module that SEACAS uses for the CGNS library which correctly detects the dependency if that would be useful, or it can be hard-wired into the existing module.

Bottom line though is that the cgns->hdf5 dependency is missing.

@bartlettroscoe bartlettroscoe added PA: Data Services Issues that fall under the Trilinos Data Services Product Area ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates labels Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…or cee-rhel6 intel builds (trilinos#3891)

There is currently a strange link error.  Only this one test execuable is
impacted.

This test links and runs just fine in the other 'cee-rhel6' builds so
disabling it for now in these Intel builds is not so terrible and this can
still be fixed offline.  By disabling it for now, we remove a lot of red spam
from the CDash site and the ATDM Trilinos summary emails (see trilinos#2933).
@bartlettroscoe
Copy link
Member Author

Bottom line though is that the cgns->hdf5 dependency is missing.

@gsjaardema, can we just fix this by adding the set of HDF5 libs to the set of CGNS libs manually? That is what we do for some of the other TPLs. (NOTE: The right way to fix this is to extend TriBITS to track dependencies between TPLs as per the larger needed refactoring TriBITSPub/TriBITS#63).

@gsjaardema
Copy link
Contributor

@bartlettroscoe Yes, that would be a reasonable temporary workaround. Only depends on -lhdf5 which depends on -lz and -ldl

@gsjaardema
Copy link
Contributor

I think this has been fixed. If not, please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project client: SPARC Issues related to or needed more specifically by the ATDM SPARC code Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: seacas type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants