Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of Trilinos Develop results in failed NaluCFD/Nalu verification runs #2886

Closed
spdomin opened this issue Jun 5, 2018 · 24 comments
Closed

Comments

@spdomin
Copy link
Contributor

spdomin commented Jun 5, 2018

A recent change in NaluCFD by @alanw0 required a new update from Trilinos. The new update causes several verification tests to fail.

We are starting a bisect and will inform this ticket when the results are completed.

Bad: commit 1f9b8c5
Good: commit 7fa1543

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2018

Trilinos is kind of in flux at the moment due to some issues with the recent Kokkos promotion. Here's some reading material, but the short answer is that it may pay to wait a bit.

FYI, Trilinos' master branch now gets updated on average once every other day, and every day if the Dashboard is clean. @william76 can tell you more. This suggests that best practice would be to depend on the master branch, rather than the develop branch.

#2390
kokkos/kokkos#1652
#2863
#2827

#2874 (comment)
kokkos/kokkos#1653 (comment)
#2879 (comment)

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

@mhoemmen I don't know if this will "ring a bell" or not, but the error produced when running @spdomin 's verification case with the updated trilinos is as follows:

terminate called after throwing an instance of 'std::runtime_error'
what(): /home/spdomin/gitHubWork/scratch_build/packages/Trilinos/packages/ifpack2/src/Ifpack2_Details_Chebyshev_def.hpp:880:

Throw number = 1

Throw test that evaluated to true: STS::isnaninf (computedLambdaMax)

Ifpack2::Chebyshev::compute: Estimation of the max eigenvalue of D^{-1} A failed, by producing Inf or NaN. This probably means that the matrix contains Inf or NaN values, or that it is badly scaled.

I thought that maybe this is related to my recent change of setting the parameter "compute local triangular constants" -> false, but if I set that to true the above error still occurs. I will dump the matrix and see if there are any nans or infs.

@spdomin
Copy link
Contributor Author

spdomin commented Jun 5, 2018

Also, if I run the case five times, it will both pass and fail due to NANS.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2018

@spdomin Just curious -- are you running with multiple OpenMP threads?

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2018

@alanw0 Hm, we haven't changed Chebyshev recently as far as I know. I'm guessing that this is related to @spdomin 's observation, and is in turn related possibly to recent sparse matrix-matrix multiply changes. Would you happen to know at what multigrid level this happens?

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

@mhoemmen I don't currently know what multigrid level, but I'm doing a debug build and will be going in with totalview so I'll let you know if I discover new information.

@spdomin
Copy link
Contributor Author

spdomin commented Jun 5, 2018

My intent is to be running with NUM_THREADS = 1 and to be running in light SIMD mode off of my blade.

@spdomin
Copy link
Contributor Author

spdomin commented Jun 5, 2018

Also, this case NANs on the first solve (momentum 3x3) before AMG is hit. GMRES/SGS

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

Hmm. My debug build doesn't hit the throw/nan. I confirmed that the debug and release builds are using the same version of trilinos. That makes it hard to debug, I hate it when a bug manifests in release and not debug.

Also, in my build, the execution space is Kokkos::Serial, no threads. (We have a unit-test that prints out the execution space.)

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2018

@alanw0 Tpetra::MultiVector's constructor fills the multivector with NaNs in debug mode. You can turn this (as well as other debug-mode things) off, even in a debug build, by setting the environment variable TPETRA_DEBUG=OFF. However, code should never be reading those NaNs; something else must be going on.

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

@mhoemmen I tried setting TPETRA_DEBUG=OFF, no effect. But that makes sense since I'm only seeing failures in release mode. I'm seeing the intermittent behavior same as @spdomin i.e., it fails on some runs, and runs successfully on other runs. When it fails, it is failing during muelu setup for the continuity equation. The momentum system is solving successfully. I'm dumping the matrices, the momentum matrix contains no nans, but the continuity matrix contains nans when it fails.

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

@mhoemmen I'm running out of ideas, but here's one last piece of information:
In our loadComplete function we do this:
sharedNotOwnedMatrix_->fillComplete();
ownedMatrix_->doExport(sharedNotOwnedMatrix_, exporter_, ADD);
ownedMatrix_->fillComplete();
I'm dumping ownedMatrix_ after this, and I see the nans in it. If I also dump ownedMatrix before the 'doExport' call, then it never fails. Never has nans. So this makes it seem like there's some race condition in doExport that doesn't show up if I put the 'writeSparseFile' call immediately before it.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2018

@alanw0 sharedNotOwnedMatrix has a possibly overlapping row Map that certainly does not equal the range Map. This means that its fillComplete call needs to specify the domain and range Maps explicitly.

@spdomin
Copy link
Contributor Author

spdomin commented Jun 5, 2018

That sounds like a good clue, @mhoemmen.

I am trying to sort out if this is on Nalu or Trilinos? Might it simply be that our interface to Trilinos is off and a newer version is now sensitive to that? I guess I am wondering if the bisect on Trilinos will provide any good evidence? However, we are still trying the bisect to see if it sheds some clues.

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

I will try specifying the range and domain maps, and let you know if that makes a difference.

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

I think in our case, owned-rows-map == range-map == domain-map

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2018

@alanw0 Right, this is just about the sharedNotOwnedMatrix. Its row Map is (in fact, had better be!) not the same as the range Map.

@alanw0
Copy link
Contributor

alanw0 commented Jun 5, 2018

@mhoemmen yes that's right the sharedNotOwnedMatrix has a sharedNotOwnedRowsMap which is not the same as ownedRowsMap. Unfortunately specifying range and domain maps for fillComplete doesn't solve the intermittent failure. I'm currently pursuing some valgrind issues which so far have not proven to be related but I'll let you know if anything gets resolved or if any new evidence surfaces.

@spdomin
Copy link
Contributor Author

spdomin commented Jun 5, 2018

Yes, we think that this is actually a memory nuance in Nalu... I am looking into that.

@alanw0
Copy link
Contributor

alanw0 commented Jun 6, 2018

Robert created a patch which altered the wedge face-grad-op function, and running that seems to fix this issue, indicating that it is a memory issue in nalu. Still very puzzling that it was also seemingly solved by reverting the trilinos version...

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 6, 2018

@alanw0 That's weird. Did y'all change some solver settings that maybe tested out a bit of Trilinos that y'all haven't tried before? I don't want to miss this chance to catch a bug :)

@spdomin
Copy link
Contributor Author

spdomin commented Jun 6, 2018

@mhoemmen, @mbarone81 is running a bisect and will report back on his findings.

@mbarone81
Copy link

I attempted to bisect Trilinos to see when the issue with Nalu first became apparent. The final bisect step was unsuccessful because Trilinos did not build. Example of the compile errors:

Trilinos/packages/kokkos-kernels/src/common/KokkosKernels_Utils.hpp:247:54: error: no matching function for call to ?atomic_fetch_add(long unsigned int*, int)

This was for commit dee20c5.

The last 'good Trilinos' identified by the bisect was 6b95d93. The first 'bad Trilinos' identified by the bisect was 35f8c36. However, subsequent commit 279052e was again 'good', while all sampled later commits were 'bad'.

So the bisect does not identify a single commit where Nalu behavior changed on the test problem. This seems consistent with the evidence gathered so far that points to a Nalu memory bug as the culprit.

@spdomin
Copy link
Contributor Author

spdomin commented Jun 8, 2018

Despite the initial data that suggested a nuance with the Trilinos update, with Matt's bisect and Roberts revamping of the Wedge::face_grad_op, we seemed to have resolved the issue. As such, I am closing this with thanks to everyone involved.

@spdomin spdomin closed this as completed Jun 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants