-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory check failing for a Nalu application test #594
Comments
Stefan, Do any of the NaluRTest simulations show similar Valgrind warnings? How big is the production run that you ran (DOFs, processes)? Jonathan |
This is a tiny 750k element, sixteen core job that is not part of the regression test suite. However, my guess is that the performance/uqSlidingMesh case would demonstrate this issue as well (just make sure to reduce the termination time step to facilitate faster turn around. However, the valgrind results clearly show the problem code calls. Why not just fix the memory issues already documented? As noted, if a debug would help, let me know. |
That's what I intend to do, but it's helpful if I can reproduce it locally. |
I am running the debug case now so that you have exact line numbers. Once I have that, I will send it off. If you provide a patch, I can apply and test. I am running an older Trilinos code base with head Nalu and managed to get past this immediate impediment. In the meantime, I will push this exact case to a common platform for you to test (private email). |
My debug case took forever and eventually was killed... We may just need to reply on the opt runs and expertise in looking at the code in question. I will try once more to see if a debug valgrind run works. The previous case hanged at the Muelu step. Any news on your end? |
About how long should this simulation or the uqSlidingMesh take to run? |
Nothing to report yet. The simulations are really slow to run. |
You need to reduce the total number of time steps that the UQ performance tests run. The problem can I sent you takes two time steps. |
I'm building Nalu (0da09d63) against trilinos develop, 15f990a. I'm seeing many valgrind warnings in what appears to be the assembly process:
Here are some examples:
and
and
and
|
Now rebuilding Nalu against 930cce3, the Trilinos SHA1 that was the last merge with Sierra. |
@spdomin Running the small simulation ( rotatingcube750k_2blks) you sent me, I'm seeing many valgrind warnings in STK and elsewhere when I use Nalu 0da09d6 built with Trilinos 930cce3. The warnings occur during the assembly phase. I will put the full log on the CEE lan in ~jhu/forSpdomin. I am concerned that this is not the same version of Nalu and/or Trilinos that you are using. Can you confirm whether the versions are the same? There may be an issue in the Tpetra Distributor, but thus far, I can't get to the errors there because of the prior valgrind warnings. |
After reconfiguring Trilinos 930cce3 with |
My hope was to trigger an exception in the Tpetra Distributor where the valgrind logs indicate a UMR. |
Update: I rebuilt Trilinos development branch (696d3ef) and Nalu master (0da09d6) using the SEMS module environment: gcc 4.9.2, OpenMPI 1.10.1, boost 1.55, zlib 1.2.8, hdf5 1.8.12, netcdf 4.3.2. Additionally, @mhoemmen added bounds checking to Tpetra::MultiVector’s pack and unpack kernels where the valgrind error occurs (thanks @mhoemmen). All of the prior STK valgrind errors are now gone. The simulation has been running on my workstation for about 17 hours, and has just finished the first momentum solve. So it may be a day or so before the simulation hits the problem area. |
I am using the latest github Nalu version. The memory issues that I reported through the Sierra build is from a very old Nalu version under Sierra. However, since it provided line numbers, I thought this would be useful. I looked at your memory check. Those issues in STK will be reported to the STK team. Thanks |
Thanks for the information on Nalu. What branch of Trilinos are you using - a release, master, or develop? And what's the SHA1 of the branch? |
By the way, I'm not confident at all that the valgrind errors for STK are real. Those showed up when I used TPLs that I rolled on my own. I haven't seen the errors (yet) since I rebuilt with the SEMS TPLs. |
I test Trilinos nightly and have noted the same kills. However, the specific case that I am running now uses the following: commit 6fec5c0 In my case, I did not see the STK issues, so perhaps they are new? I will still report it to the STK team since there are real issues surrounding my usage of custom ghosting. |
What branch are you on: develop, master, or a release? |
Trilinos master. I was instructed to avoid the develop location, although as you have seen from recent messages, I am considering switching my testing protocol to find issues faster. Please call me today since a lot of the above conversation seems to indicate a personal connection is required. |
The rotatingcube750k_2blks simulation on 16 cores finished last night. The error reported in UnpackArrayMultiColumn did not manifest. The only errors reported are related to OpenMPI, which if memory serves, has never been clean under valgrind. Note that I was using the Trilinos "develop" branch.
This suggests to me that there was a problem in master that has since been fixed in develop. There are now just a few diffs between master and develop:
|
I test against head. If this is fixed in develop then when will head be updated? Also, what is the policy when a bug is found in master? Does an update occur ASAP? Also, did you start testing off of develop straight out? |
@spdomin It looks like master was updated to be sychronized with develop since my last post. @bmpersc Could you verify whether this is correct?
If there is a serious bug in master (which I'd say this one is), we would try to patch master asap.
Yes. It wasn't clear to me which branch you were using. In the past you've used the Trilinos development version, I assumed that's what you were using this time. |
Yes I updated the master branch from develop this morning. |
Okay, I was lax in my usage of HEAD... In that last message, I should have said master... Sorry, I will be more precise next time. |
I have updated my Nalu executable using the following Trillions SHA-1 from master: commit d96723d When I run the cube simulation with the following memory diagnostic: realms:
I still see a slight increase in max memory as a function of simulation time step: Memory Overview at step 1 and 45: nalu memory: total (over all cores) current/high-water mark= 5.7514 G 6.14105 G I will let this case run out to see if it fails as before. |
This is really depressing. My simulation died at the same code path that it did before. In the Muelu solve at step 492 using the above Trilinos master SHA-1. @jhux2 can you try running your case out without memory checking to see if it also dies at this step? @spdomin Sure... I am not holding out hope that memory checking causes this given my production run died at the same step without this diagnostic active, however, will report back in the morning. |
With the Trilinos SHA-1: **d96723df90660185a9d117eaa7788b085c9afd2d And the Trilinos install as follows: /gitHubWork/scratch_build/install/gcc4.8.5/Trilinos_stable_release/include/MueLu_XpetraOperator_decl.hpp With the Nalu config as follows: nalu_install_dir=/gitHubWork/scratch_build/install/gcc4.8.5 I am still seeing the following memory errors: ==14974== Conditional jump or move depends on uninitialised value(s)
|
@spdomin Would you consider trying a debug build of Trilinos?
This turns on bounds checking for both Kokkos::View and Teuchos::Array*, as well as other debug-mode checks. Tpetra in particular does extra checking specifically on the pack and unpack kernels when you set these options. These checks have a run-time cost, but are cheaper than running valgrind. (They should still be compatible with valgrind, but they make valgrind unnecessary for common errors.) |
Once we fix this, we'll write a retrospective summary. Per discussion at Trilinos Leaders' Meeting today, we've listed this summary-writing activity as a separate issue, namely #632. |
This may make the Valgrind errors go away, but I'm not convinced that this is the issue. It's worth crossing off our list, though! |
This is related to github issues #594 and #294. Valgrind is issuing spurious warnings if the target multivector of a doImport/doExport call is uninitialized in MueLu::Hierarchy. This commit initializes the vectors and eliminates the warnings. Build/Test Cases Summary Enabled Packages: MueLu Enabled all Forward Packages 0) MPI_DEBUG => Skipped configure, build, test due to no enabled packages! => Does not affect push readiness! (-1.00 min) 1) SERIAL_RELEASE => Skipped configure, build, test due to no enabled packages! => Does not affect push readiness! (-1.00 min) 2) MPI_SS_DEBUG => passed: passed=138,notpassed=0 (44.31 min) 3) SERIAL_SS_RELEASE => passed: passed=50,notpassed=0 (1.48 min)
Closing this bug report. Please reopen if necessary. |
Generally, it would seem that we verify a change fixed an issue before we close, right? I will build and retest and report back tomorrow. That is, unless you have run my production cube case and know that it now runs? |
I verified that my runs are free of the memory issues. I am testing the production run now. Also, I had a build error in develop: /home/spdomin/gitHubWork/Nalu/src/TpetraLinearSystem.C: In member function ‘virtual void sierra::nalu::TpetraLinearSystem::finalizeLinearSystem()’: which I fixed by using Teuchos::rcp. Is that something new today in develop? |
@spdomin Thanks for the update. rcp has always been part of the Teuchos namespace, so I'm not sure how line 850 ever compiled. |
I made some changes last week to improve build time, by not importing so much stuff into the Tpetra namespace. Many (but weirdly, not all) Teuchos classes were getting imported into the Tpetra namespace. I got rid of those imports. |
Right. Build issues are over. I have the production test running and will report back later. Thanks. |
It is perfectly safe (and often beneficial) to import class names from one namespace to another. However, it is never safe to import non-member function names. See Appendix D.2 "Amendments for ’using’ declarations and directives": Why constantly write |
@bartlettroscoe My eventual goal is to stop including all those Teuchos header files in Tpetra_ConfigDefs.hpp, since more includes means longer build times. In order for Tpetra_ConfigDefs.hpp not to include so many Teuchos header files, I first had to get rid of the "using Teuchos::$THING" declarations in Tpetra_ConfigDefs.hpp. I could put those "using" declarations elsewhere, but it's not clear where they should go in order to avoid duplication. I would really rather just not have them. This only affects Tpetra developers; users shouldn't need to care. |
@spdomin How is the simulation failing? Can you tell if the simulation is running out of memory? |
@ambrad thought this might be related to a Drekar issue we are seeing. Looking back, about the same time @spdomin started seeing this issue, Drekar started to have solver issues too. It manifested as growth in memory over multiple solves. @pwxy reported the following today in his efforts to track this down: Here is more information concerning the memory growth issue during the Newton steps when TFQMR is used on the drekar 3D MHD generator test case (70x10x10 elements run on a single MPI process). drekar is built with OpenMPI 1.8.8 and gnu 4.9.3 When I switch from tpetra to epetra/aztec/ML/ifpack ILU, for a 3-level ML after 10 Newton steps, the memory hardly grows, increasing by only 0.7% over the end of the first Newton step. If I stay with epetra/aztec/ML, but swap ifpack ILU for ML SGS smoother, for a 3-level ML after 10 Newton steps, the memory doesn't grow compared to the end of the first Newton step. If I stay with epetra/aztec/ML, and keep ifpack ILU smoother, but force ML to have only one level (so a 1-level additive schwarz domain decomposition preconditioner), after 10 Newton steps, the memory hardly grows, increasing by only 0.5% over the end of the first Newton step. So epetra/aztec/ML behaves the way one would expect in terms of memory change during Newton steps with TFQMR krylov solver (just a tiny memory growth---less than 1%) However, tpetra/belos/muelu/ifpack2 RILUK is a different story If we run tpetra/belos/muelu/ifpack2 SGS smoother (not RILUK), and we force muelu to have only 1-level (so a 1-level additive schwarz domain decomposition preconditioner), then everything is hunky-dory --- after 10 newton steps of TFQMR the memory does not grow as one would expect. However, for tpetra/belos/muelu/ifpack2 RILUK but we force muelu to have only 1-level (so a 1-level additive schwarz domain decomposition preconditioner), the memory increases by 17% over the end of the first Newton step. Not good. For tpetra/belos/muelu/ifpack2 SGS (not RILUK), for a 3-level MueLu after 10 Newton steps, the memory increases by 4.4% over the end of the first Newton step. Compare with the epetra/aztec/ML with ML SGS smoother were there was no memory growth. For tpetra/belos/muelu/ifpack2 RILUK, for a 3-level MueLu after 10 Newton steps, the memory increases by 25% over the end of the first Newton step. Not good. Compare this with epetra/aztec/ML where the memory grew by only 0.7% This limited data points to a potential memory issue with ifpack2 RILUK. However, MueLu with ifpack2 SGS has larger memory growth than it should. Thanks @trilinos/panzer @eric-c-cyr |
To clarify, #558 is a separate issue, at least in symptom. I'm looking into both simultaneously because I suspect similar analyses of the same examples may yield the solution to both problems, whether or not they are related. |
#558 is a separate issue. Nalu doesn't use the ifpack2 AdditiveSchwarz with overlap. |
I think the memory growth issue with tpetra/belos/muelu/ifpack2 RILUK that I'm seeing with drekar is different from the issue in nalu. The former issue exists in the February 2016 trilinos code base. |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
I have a full-up production run now failing using the latest Trilinos version. This test reinitializes the linear solver each and evert tilmestep and, as usual, solves many systems over the simulation before it fails (repeatably) at step 490. A memory check of the test revealed the following suspect memory reports.
The text was updated successfully, but these errors were encountered: