-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update of Trilinos Develop results in failed NaluCFD/Nalu verification runs #2886
Comments
Trilinos is kind of in flux at the moment due to some issues with the recent Kokkos promotion. Here's some reading material, but the short answer is that it may pay to wait a bit. FYI, Trilinos' master branch now gets updated on average once every other day, and every day if the Dashboard is clean. @william76 can tell you more. This suggests that best practice would be to depend on the master branch, rather than the develop branch. #2390 #2874 (comment) |
@mhoemmen I don't know if this will "ring a bell" or not, but the error produced when running @spdomin 's verification case with the updated trilinos is as follows: terminate called after throwing an instance of 'std::runtime_error' Throw number = 1 Throw test that evaluated to true: STS::isnaninf (computedLambdaMax) Ifpack2::Chebyshev::compute: Estimation of the max eigenvalue of D^{-1} A failed, by producing Inf or NaN. This probably means that the matrix contains Inf or NaN values, or that it is badly scaled. I thought that maybe this is related to my recent change of setting the parameter "compute local triangular constants" -> false, but if I set that to true the above error still occurs. I will dump the matrix and see if there are any nans or infs. |
Also, if I run the case five times, it will both pass and fail due to NANS. |
@spdomin Just curious -- are you running with multiple OpenMP threads? |
@mhoemmen I don't currently know what multigrid level, but I'm doing a debug build and will be going in with totalview so I'll let you know if I discover new information. |
My intent is to be running with NUM_THREADS = 1 and to be running in light SIMD mode off of my blade. |
Also, this case NANs on the first solve (momentum 3x3) before AMG is hit. GMRES/SGS |
Hmm. My debug build doesn't hit the throw/nan. I confirmed that the debug and release builds are using the same version of trilinos. That makes it hard to debug, I hate it when a bug manifests in release and not debug. Also, in my build, the execution space is Kokkos::Serial, no threads. (We have a unit-test that prints out the execution space.) |
@alanw0 Tpetra::MultiVector's constructor fills the multivector with NaNs in debug mode. You can turn this (as well as other debug-mode things) off, even in a debug build, by setting the environment variable |
@mhoemmen I tried setting TPETRA_DEBUG=OFF, no effect. But that makes sense since I'm only seeing failures in release mode. I'm seeing the intermittent behavior same as @spdomin i.e., it fails on some runs, and runs successfully on other runs. When it fails, it is failing during muelu setup for the continuity equation. The momentum system is solving successfully. I'm dumping the matrices, the momentum matrix contains no nans, but the continuity matrix contains nans when it fails. |
@mhoemmen I'm running out of ideas, but here's one last piece of information: |
@alanw0 |
That sounds like a good clue, @mhoemmen. I am trying to sort out if this is on Nalu or Trilinos? Might it simply be that our interface to Trilinos is off and a newer version is now sensitive to that? I guess I am wondering if the bisect on Trilinos will provide any good evidence? However, we are still trying the bisect to see if it sheds some clues. |
I will try specifying the range and domain maps, and let you know if that makes a difference. |
I think in our case, owned-rows-map == range-map == domain-map |
@alanw0 Right, this is just about the sharedNotOwnedMatrix. Its row Map is (in fact, had better be!) not the same as the range Map. |
@mhoemmen yes that's right the sharedNotOwnedMatrix has a sharedNotOwnedRowsMap which is not the same as ownedRowsMap. Unfortunately specifying range and domain maps for fillComplete doesn't solve the intermittent failure. I'm currently pursuing some valgrind issues which so far have not proven to be related but I'll let you know if anything gets resolved or if any new evidence surfaces. |
Yes, we think that this is actually a memory nuance in Nalu... I am looking into that. |
Robert created a patch which altered the wedge face-grad-op function, and running that seems to fix this issue, indicating that it is a memory issue in nalu. Still very puzzling that it was also seemingly solved by reverting the trilinos version... |
@alanw0 That's weird. Did y'all change some solver settings that maybe tested out a bit of Trilinos that y'all haven't tried before? I don't want to miss this chance to catch a bug :) |
@mhoemmen, @mbarone81 is running a bisect and will report back on his findings. |
I attempted to bisect Trilinos to see when the issue with Nalu first became apparent. The final bisect step was unsuccessful because Trilinos did not build. Example of the compile errors: Trilinos/packages/kokkos-kernels/src/common/KokkosKernels_Utils.hpp:247:54: error: no matching function for call to ?atomic_fetch_add(long unsigned int*, int) This was for commit dee20c5. The last 'good Trilinos' identified by the bisect was 6b95d93. The first 'bad Trilinos' identified by the bisect was 35f8c36. However, subsequent commit 279052e was again 'good', while all sampled later commits were 'bad'. So the bisect does not identify a single commit where Nalu behavior changed on the test problem. This seems consistent with the evidence gathered so far that points to a Nalu memory bug as the culprit. |
Despite the initial data that suggested a nuance with the Trilinos update, with Matt's bisect and Roberts revamping of the Wedge::face_grad_op, we seemed to have resolved the issue. As such, I am closing this with thanks to everyone involved. |
A recent change in NaluCFD by @alanw0 required a new update from Trilinos. The new update causes several verification tests to fail.
We are starting a bisect and will inform this ticket when the results are completed.
Bad: commit 1f9b8c5
Good: commit 7fa1543
The text was updated successfully, but these errors were encountered: