Update of Trilinos Develop results in failed NaluCFD/Nalu verification runs #2886

spdomin · 2018-06-05T18:11:18Z

A recent change in NaluCFD by @alanw0 required a new update from Trilinos. The new update causes several verification tests to fail.

We are starting a bisect and will inform this ticket when the results are completed.

Bad: commit 1f9b8c5
Good: commit 7fa1543

The text was updated successfully, but these errors were encountered:

mhoemmen · 2018-06-05T18:39:06Z

Trilinos is kind of in flux at the moment due to some issues with the recent Kokkos promotion. Here's some reading material, but the short answer is that it may pay to wait a bit.

FYI, Trilinos' master branch now gets updated on average once every other day, and every day if the Dashboard is clean. @william76 can tell you more. This suggests that best practice would be to depend on the master branch, rather than the develop branch.

#2390
kokkos/kokkos#1652
#2863
#2827

#2874 (comment)
kokkos/kokkos#1653 (comment)
#2879 (comment)

alanw0 · 2018-06-05T18:47:52Z

@mhoemmen I don't know if this will "ring a bell" or not, but the error produced when running @spdomin 's verification case with the updated trilinos is as follows:

terminate called after throwing an instance of 'std::runtime_error'
what(): /home/spdomin/gitHubWork/scratch_build/packages/Trilinos/packages/ifpack2/src/Ifpack2_Details_Chebyshev_def.hpp:880:

Throw number = 1

Throw test that evaluated to true: STS::isnaninf (computedLambdaMax)

Ifpack2::Chebyshev::compute: Estimation of the max eigenvalue of D^{-1} A failed, by producing Inf or NaN. This probably means that the matrix contains Inf or NaN values, or that it is badly scaled.

I thought that maybe this is related to my recent change of setting the parameter "compute local triangular constants" -> false, but if I set that to true the above error still occurs. I will dump the matrix and see if there are any nans or infs.

spdomin · 2018-06-05T18:53:29Z

Also, if I run the case five times, it will both pass and fail due to NANS.

mhoemmen · 2018-06-05T19:04:46Z

@spdomin Just curious -- are you running with multiple OpenMP threads?

mhoemmen · 2018-06-05T19:05:47Z

@alanw0 Hm, we haven't changed Chebyshev recently as far as I know. I'm guessing that this is related to @spdomin 's observation, and is in turn related possibly to recent sparse matrix-matrix multiply changes. Would you happen to know at what multigrid level this happens?

alanw0 · 2018-06-05T19:09:36Z

@mhoemmen I don't currently know what multigrid level, but I'm doing a debug build and will be going in with totalview so I'll let you know if I discover new information.

spdomin · 2018-06-05T19:20:52Z

My intent is to be running with NUM_THREADS = 1 and to be running in light SIMD mode off of my blade.

spdomin · 2018-06-05T19:21:40Z

Also, this case NANs on the first solve (momentum 3x3) before AMG is hit. GMRES/SGS

alanw0 · 2018-06-05T19:42:32Z

Hmm. My debug build doesn't hit the throw/nan. I confirmed that the debug and release builds are using the same version of trilinos. That makes it hard to debug, I hate it when a bug manifests in release and not debug.

Also, in my build, the execution space is Kokkos::Serial, no threads. (We have a unit-test that prints out the execution space.)

mhoemmen · 2018-06-05T19:51:13Z

@alanw0 Tpetra::MultiVector's constructor fills the multivector with NaNs in debug mode. You can turn this (as well as other debug-mode things) off, even in a debug build, by setting the environment variable TPETRA_DEBUG=OFF. However, code should never be reading those NaNs; something else must be going on.

alanw0 · 2018-06-05T20:00:52Z

@mhoemmen I tried setting TPETRA_DEBUG=OFF, no effect. But that makes sense since I'm only seeing failures in release mode. I'm seeing the intermittent behavior same as @spdomin i.e., it fails on some runs, and runs successfully on other runs. When it fails, it is failing during muelu setup for the continuity equation. The momentum system is solving successfully. I'm dumping the matrices, the momentum matrix contains no nans, but the continuity matrix contains nans when it fails.

alanw0 · 2018-06-05T20:26:16Z

@mhoemmen I'm running out of ideas, but here's one last piece of information:
In our loadComplete function we do this:
sharedNotOwnedMatrix_->fillComplete();
ownedMatrix_->doExport(sharedNotOwnedMatrix_, exporter_, ADD);
ownedMatrix_->fillComplete();
I'm dumping ownedMatrix_ after this, and I see the nans in it. If I also dump ownedMatrix before the 'doExport' call, then it never fails. Never has nans. So this makes it seem like there's some race condition in doExport that doesn't show up if I put the 'writeSparseFile' call immediately before it.

mhoemmen · 2018-06-05T21:27:17Z

@alanw0 sharedNotOwnedMatrix has a possibly overlapping row Map that certainly does not equal the range Map. This means that its fillComplete call needs to specify the domain and range Maps explicitly.

spdomin · 2018-06-05T21:41:03Z

That sounds like a good clue, @mhoemmen.

I am trying to sort out if this is on Nalu or Trilinos? Might it simply be that our interface to Trilinos is off and a newer version is now sensitive to that? I guess I am wondering if the bisect on Trilinos will provide any good evidence? However, we are still trying the bisect to see if it sheds some clues.

alanw0 · 2018-06-05T21:58:19Z

I will try specifying the range and domain maps, and let you know if that makes a difference.

alanw0 · 2018-06-05T22:00:24Z

I think in our case, owned-rows-map == range-map == domain-map

mhoemmen · 2018-06-05T22:17:24Z

@alanw0 Right, this is just about the sharedNotOwnedMatrix. Its row Map is (in fact, had better be!) not the same as the range Map.

alanw0 · 2018-06-05T22:43:52Z

@mhoemmen yes that's right the sharedNotOwnedMatrix has a sharedNotOwnedRowsMap which is not the same as ownedRowsMap. Unfortunately specifying range and domain maps for fillComplete doesn't solve the intermittent failure. I'm currently pursuing some valgrind issues which so far have not proven to be related but I'll let you know if anything gets resolved or if any new evidence surfaces.

spdomin · 2018-06-05T23:30:33Z

Yes, we think that this is actually a memory nuance in Nalu... I am looking into that.

alanw0 · 2018-06-06T15:17:03Z

Robert created a patch which altered the wedge face-grad-op function, and running that seems to fix this issue, indicating that it is a memory issue in nalu. Still very puzzling that it was also seemingly solved by reverting the trilinos version...

mhoemmen · 2018-06-06T15:31:03Z

@alanw0 That's weird. Did y'all change some solver settings that maybe tested out a bit of Trilinos that y'all haven't tried before? I don't want to miss this chance to catch a bug :)

spdomin · 2018-06-06T16:55:07Z

@mhoemmen, @mbarone81 is running a bisect and will report back on his findings.

mbarone81 · 2018-06-06T19:03:13Z

I attempted to bisect Trilinos to see when the issue with Nalu first became apparent. The final bisect step was unsuccessful because Trilinos did not build. Example of the compile errors:

Trilinos/packages/kokkos-kernels/src/common/KokkosKernels_Utils.hpp:247:54: error: no matching function for call to ?atomic_fetch_add(long unsigned int*, int)

This was for commit dee20c5.

The last 'good Trilinos' identified by the bisect was 6b95d93. The first 'bad Trilinos' identified by the bisect was 35f8c36. However, subsequent commit 279052e was again 'good', while all sampled later commits were 'bad'.

So the bisect does not identify a single commit where Nalu behavior changed on the test problem. This seems consistent with the evidence gathered so far that points to a Nalu memory bug as the culprit.

spdomin · 2018-06-08T16:15:42Z

Despite the initial data that suggested a nuance with the Trilinos update, with Matt's bisect and Roberts revamping of the Wedge::face_grad_op, we seemed to have resolved the issue. As such, I am closing this with thanks to everyone involved.

spdomin closed this as completed Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update of Trilinos Develop results in failed NaluCFD/Nalu verification runs #2886

Update of Trilinos Develop results in failed NaluCFD/Nalu verification runs #2886

spdomin commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

spdomin commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

spdomin commented Jun 5, 2018

spdomin commented Jun 5, 2018 •

edited

Loading

alanw0 commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

alanw0 commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

spdomin commented Jun 5, 2018

alanw0 commented Jun 5, 2018

alanw0 commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

spdomin commented Jun 5, 2018

alanw0 commented Jun 6, 2018

mhoemmen commented Jun 6, 2018

spdomin commented Jun 6, 2018 •

edited

Loading

mbarone81 commented Jun 6, 2018

spdomin commented Jun 8, 2018

Update of Trilinos Develop results in failed NaluCFD/Nalu verification runs #2886

Update of Trilinos Develop results in failed NaluCFD/Nalu verification runs #2886

Comments

spdomin commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

spdomin commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

spdomin commented Jun 5, 2018

spdomin commented Jun 5, 2018 • edited Loading

alanw0 commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

alanw0 commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

spdomin commented Jun 5, 2018

alanw0 commented Jun 5, 2018

alanw0 commented Jun 5, 2018

mhoemmen commented Jun 5, 2018

alanw0 commented Jun 5, 2018

spdomin commented Jun 5, 2018

alanw0 commented Jun 6, 2018

mhoemmen commented Jun 6, 2018

spdomin commented Jun 6, 2018 • edited Loading

mbarone81 commented Jun 6, 2018

spdomin commented Jun 8, 2018

spdomin commented Jun 5, 2018 •

edited

Loading

spdomin commented Jun 6, 2018 •

edited

Loading