Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amesos2::SuperLU_DIST 4.3 segfault #124

Closed
jdbooth opened this issue Feb 2, 2016 · 15 comments
Closed

Amesos2::SuperLU_DIST 4.3 segfault #124

jdbooth opened this issue Feb 2, 2016 · 15 comments
Assignees
Labels
pkg: Amesos2 type: bug The primary issue is a bug in Trilinos code or tests

Comments

@jdbooth
Copy link
Contributor

jdbooth commented Feb 2, 2016

Tested SuperLU_DIST 4.3 (Newest as of December 31 2015).
GCC 4.9 with both Trilinos DEBUG + -O0 and SuperLU_DIST debuglvl = 2
Segfaults if using more than 1 mpi-rank.
Memory error most likely on Amesos2 side.

@trilinos/amesos2

@jdbooth jdbooth added type: bug The primary issue is a bug in Trilinos code or tests pkg: Amesos2 labels Feb 2, 2016
@jdbooth jdbooth self-assigned this Feb 2, 2016
@jdbooth
Copy link
Contributor Author

jdbooth commented Feb 9, 2016

Notes: This was seen with superlu_dist debug > 1 only. Which requires modification . Changed #ifdef(>) to #ifdef debug

@jwillenbring
Copy link
Member

@amklinv I wanted to make sure you see the latest comment on this. Please see the full issue body.

@amklinv
Copy link

amklinv commented Feb 9, 2016

To clarify Josh's comment, when you try to build superlu_dist with debug > 1, it won't even build. Josh had to modify the superlu_dist code to get it to build at all. I contacted Sherry on Friday to let her know about this issue, but I have not heard back from her yet.

I compiled using a slightly different version of gcc 4.9 and openmpi 1.8 with debugging turned off, and the test passed with that configuration.

@jdbooth
Copy link
Contributor Author

jdbooth commented Feb 9, 2016

@jwillenbring For the benefit of Jim, I will also post what I told @amklinv . There are numerous valgrind reported errors for this build. I started to try to track them down to see if they were because of how we are freeing the superlu structure or in superlu_dist. As a first step, I turned on debug for both trilinos and superludist to see where these warning/errors where occurring. However, superlu_dist 4.3 will not compile in debug unless change some of the ifdefs yourself (This is not true for 4.2). If you do changes these and run it, you will get a random segfault (Again, not true of 4.2).
I plan to work on tracking this down later this month, but got side tracked for this week with something I have to deliver. At this time, I cannot confirm where the memory errors are coming from. Since 4.2 is working (4.3 only being out for a month), this message stands as a warning that Amesos2 is not 100% confident in its testing of Superlu_dist 4.3, until I get to the bottom of all the issues.

@srajama1
Copy link
Contributor

srajama1 commented Feb 9, 2016

Jim, If it is important for xSDK can you ask Sherry to check the usage in Amesos2 to confirm everything is ok. We will be happy to help.

@amklinv
Copy link

amklinv commented Feb 9, 2016

I got a response from Sherry.

"A lot of warnings are related to printing format of (long long int). I thought I had those fixed, but apparently still a lot. It will take me some time to clean up the warnings. Meanwhile, you can ignore the warnings, and run the code, see what happens. [I told her when I compiled the code without debugging enabled, I got a whole bunch of warnings.]

The complex version with high DEBUGlevel / PRNTlevel are not fully tested. I will fix those errors in next release. [This is referring to the build errors Josh was seeing.]"

@jdbooth
Copy link
Contributor Author

jdbooth commented Feb 9, 2016

In response to Sherry's reply:

  1. Warnings. The warnings I refer to are those from valgrind. They are memory and not type format.
  2. Yes, however, I was never using the complex version. I was only using double as my entry type.

@srajama1
Copy link
Contributor

Is the correct course for this issue wait for SuperLU_dist 4.4 ? We can close the issue and document this error with debug turned on in SuperLU_Dist.

Josh : How do you say this memory error is in Amesos2 ?

@ambrad
Copy link
Contributor

ambrad commented Feb 14, 2016

I'd like to clarify a point re: the valgrind errors.

My valgrind traces show that essentially the same valgrind errors are reported for a build against 4.2 as against 4.3. This is a claim that might be wrong; it would be good if someone else would confirm or contradict that there are meaningful valgrind errors in builds against 4.2. Helpfully, I find that a valgrind trace of the unit test shows essentially the same valgrind messages as a trace of a practical problem, so I believe we can focus on an analysis of just the unit test.

@jdbooth
Copy link
Contributor Author

jdbooth commented Feb 14, 2016

In regard to @srajama1 , I marked it as ameso2 because we are the one supporting integration of SuperLU_Dist with Trilinos. Yes, I planned to close this error, once I had time to investigate all of it and write up some notes to document it.

In regards to @ambrad , the valgrind warnings are a different issue. Currently we are working with Superlu_dist to try to get rid of them.

@srajama1
Copy link
Contributor

@jdbooth : It is hard to separate out memory errors and segfaults. It is easier to fix the memory errors and see if we have the same segfault.

@ambrad. If it is reproducible with 4.2 then it makes the job easier as you can see from the above comments the debug options are not working with 4.3. Either Josh or I will try this out.

@amklinv
Copy link

amklinv commented Feb 15, 2016

Not sure whether this belongs here or in a separate issue, but the SuperLU_Dist test times out for me with this configuration script. Sometimes, it times out during the first test; sometimes the first test completes quickly and it times out during the second.

do-configure.txt

@ambrad
Copy link
Contributor

ambrad commented Feb 16, 2016

@amklinv, that is essentially the behavior I noticed that started this thread. A test in an application's test suite intermittently times out on some platforms (timing out at 1500s when a successful run of this test takes ~30s). Intermittent failures are often associated with uninitialized memory or worse, so I ran valgrind on that ctest. Then I wanted to see if essentially the same valgrind messages would show up in the Trilinos ctests; they do. Etc.

@jdbooth jdbooth closed this as completed Feb 22, 2016
@srajama1
Copy link
Contributor

Josh : Can I ask why this is closed, if same errors are even in 4.2 ?

@amklinv
Copy link

amklinv commented Feb 23, 2016

@ambrad, were you seeing this timeout behavior with 4.2 as well, or is that new to 4.3?

crtrott added a commit to crtrott/Trilinos that referenced this issue Dec 17, 2020
WIth these changes MPI, off and hacking the Kokkos::Experiment::HIP::memory_space
typedef to be Kokkos::Experimental::HIPHostPinnedSpace we get these failures:

124/124 Test trilinos#124: TpetraCore_TsqrAdaptor ..........................................   Passed    0.55 sec
95% tests passed, 5 tests failed out of 124
Label Time Summary:
Tpetra    = 147.76 sec*proc (124 tests)
Total Test time (real) = 147.96 sec
The following tests FAILED:
         19 - TpetraCore_idot (Subprocess aborted)
         83 - TpetraCore_MatrixMatrix_UnitTests (NUMERICAL)
        121 - TpetraCore_RowMatrixTransposer_test (Subprocess aborted)
        122 - TpetraCore_RowMatrixTransposer_UnitTests (Failed)
        123 - TpetraCore_CrsMatrix_transpose_sortedRows (Failed)

This IS running on the AMD GPU ...
brian-kelley pushed a commit that referenced this issue Feb 17, 2021
WIth these changes MPI, off and hacking the Kokkos::Experiment::HIP::memory_space
typedef to be Kokkos::Experimental::HIPHostPinnedSpace we get these failures:

124/124 Test #124: TpetraCore_TsqrAdaptor ..........................................   Passed    0.55 sec
95% tests passed, 5 tests failed out of 124
Label Time Summary:
Tpetra    = 147.76 sec*proc (124 tests)
Total Test time (real) = 147.96 sec
The following tests FAILED:
         19 - TpetraCore_idot (Subprocess aborted)
         83 - TpetraCore_MatrixMatrix_UnitTests (NUMERICAL)
        121 - TpetraCore_RowMatrixTransposer_test (Subprocess aborted)
        122 - TpetraCore_RowMatrixTransposer_UnitTests (Failed)
        123 - TpetraCore_CrsMatrix_transpose_sortedRows (Failed)

This IS running on the AMD GPU ...
brian-kelley pushed a commit that referenced this issue Mar 30, 2021
WIth these changes MPI, off and hacking the Kokkos::Experiment::HIP::memory_space
typedef to be Kokkos::Experimental::HIPHostPinnedSpace we get these failures:

124/124 Test #124: TpetraCore_TsqrAdaptor ..........................................   Passed    0.55 sec
95% tests passed, 5 tests failed out of 124
Label Time Summary:
Tpetra    = 147.76 sec*proc (124 tests)
Total Test time (real) = 147.96 sec
The following tests FAILED:
         19 - TpetraCore_idot (Subprocess aborted)
         83 - TpetraCore_MatrixMatrix_UnitTests (NUMERICAL)
        121 - TpetraCore_RowMatrixTransposer_test (Subprocess aborted)
        122 - TpetraCore_RowMatrixTransposer_UnitTests (Failed)
        123 - TpetraCore_CrsMatrix_transpose_sortedRows (Failed)

This IS running on the AMD GPU ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Amesos2 type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants