Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory check failing for a Nalu application test #594

Closed
spdomin opened this issue Aug 30, 2016 · 75 comments
Closed

Memory check failing for a Nalu application test #594

spdomin opened this issue Aug 30, 2016 · 75 comments
Assignees
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot.

Comments

@spdomin
Copy link
Contributor

spdomin commented Aug 30, 2016

I have a full-up production run now failing using the latest Trilinos version. This test reinitializes the linear solver each and evert tilmestep and, as usual, solves many systems over the simulation before it fails (repeatably) at step 490. A memory check of the test revealed the following suspect memory reports.

==48197== Conditional jump or move depends on uninitialised value(s)
==48197==    at 0x1EDD3D1: void Kokkos::parallel_for<Tpetra::KokkosRefactor::Details::UnpackArrayMultiColumn<Kokkos::Experimental::View<double**, Kokkos::LayoutLeft, Kokkos::Serial, void>, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Kokkos::Experimental::View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Tpetra::KokkosRefactor::Details::InsertOp> >(unsigned long, Tpetra::KokkosRefactor::Details::UnpackArrayMultiColumn<Kokkos::Experimental::View<double**, Kokkos::LayoutLeft, Kokkos::Serial, void>, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Kokkos::Experimental::View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Tpetra::KokkosRefactor::Details::InsertOp> const&, std::string const&) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1EDD4E3: Tpetra::KokkosRefactor::Details::UnpackArrayMultiColumn<Kokkos::Experimental::View<double**, Kokkos::LayoutLeft, Kokkos::Serial, void>, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Kokkos::Experimental::View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Tpetra::KokkosRefactor::Details::InsertOp>::unpack(Kokkos::Experimental::View<double**, Kokkos::LayoutLeft, Kokkos::Serial, void> const&, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::Experimental::View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Tpetra::KokkosRefactor::Details::InsertOp const&, unsigned long) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1F8CAEB: Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::unpackAndCombineNew(Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<unsigned long const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, unsigned long, Tpetra::Distributor&, Tpetra::CombineMode) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x217967E: Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::doTransferNew(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Tpetra::Distributor&, Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::ReverseOption, bool) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x2176C82: Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::doTransfer(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Teuchos::ArrayView<int const> const&, Teuchos::ArrayView<int const> const&, Teuchos::ArrayView<int const> const&, Teuchos::ArrayView<int const> const&, Tpetra::Distributor&, Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::ReverseOption) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x2175AFB: Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::doExport(Tpetra::SrcDistObject const&, Tpetra::Import<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Tpetra::CombineMode) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xF39239: Xpetra::TpetraMultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::doExport(Xpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::Import<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::CombineMode) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1555EF5: MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Iterate(Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&, MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::ConvData, bool, int) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1555BD2: MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Iterate(Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&, MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::ConvData, bool, int) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x165D2F8: MueLu::TpetraOperator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::apply(Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const&, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>&, Teuchos::ETransp, double, double) const (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xFF84F8: Belos::LinearProblem<double, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>, Tpetra::Operator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::apply(Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const&, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>&) const (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xFFD34F: Belos::PseudoBlockGmresIter<double, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>, Tpetra::Operator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::iterate() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x100FEA5: Belos::PseudoBlockGmresSolMgr<double, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>, Tpetra::Operator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::solve() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xEF64E1: sierra::nalu::TpetraLinearSolver::solve(Teuchos::RCP<Tpetra::Vector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> >, int&, double&) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x128DFBF: sierra::nalu::TpetraLinearSystem::solve(stk::mesh::FieldBase*) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x11442D2: sierra::nalu::EquationSystem::assemble_and_solve(stk::mesh::FieldBase*) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x118EC7E: sierra::nalu::LowMachEquationSystem::solve_and_update() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xEDBBDC: sierra::nalu::EquationSystems::solve_and_update() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xEA6D68: sierra::nalu::Realm::advance_time_step() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xE6828F: sierra::nalu::TimeIntegrator::integrate_realm() (in /home/spdomin/gitHubWork/Nalu/build/naluX)

==48197==    by 0xE61ACE: main (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==  Uninitialised value was created by a heap allocation
==48197==    at 0x4A068FE: malloc (vg_replace_malloc.c:270)
==48197==    by 0x265DD35: Kokkos::HostSpace::allocate(unsigned long) const (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x265E7A9: Kokkos::Experimental::Impl::SharedAllocationRecord<Kokkos::HostSpace, void>::SharedAllocationRecord(Kokkos::HostSpace const&, std::string const&, unsigned long, void (*)(Kokkos::Experimental::Impl::SharedAllocationRecord<void, void>*)) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1F7C149: Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>::classic>::dual_view_type (anonymous namespace)::allocDualView<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >(unsigned long, unsigned long, bool) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1F8101E: Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::MultiVector(Teuchos::RCP<Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, unsigned long, bool) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xF7D20D: Xpetra::MultiVectorFactory<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Build(Teuchos::RCP<Xpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, unsigned long, bool) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1555EC8: MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Iterate(Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&, MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::ConvData, bool, int) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x1555BD2: MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Iterate(Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&, MueLu::Hierarchy<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::ConvData, bool, int) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x165D2F8: MueLu::TpetraOperator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::apply(Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const&, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>&, Teuchos::ETransp, double, double) const (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xFF84F8: Belos::LinearProblem<double, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>, Tpetra::Operator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::apply(Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const&, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>&) const (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xFFD34F: Belos::PseudoBlockGmresIter<double, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>, Tpetra::Operator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::iterate() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x100FEA5: Belos::PseudoBlockGmresSolMgr<double, Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>, Tpetra::Operator<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::solve() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xEF64E1: sierra::nalu::TpetraLinearSolver::solve(Teuchos::RCP<Tpetra::Vector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> >, int&, double&) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x128DFBF: sierra::nalu::TpetraLinearSystem::solve(stk::mesh::FieldBase*) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x11442D2: sierra::nalu::EquationSystem::assemble_and_solve(stk::mesh::FieldBase*) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0x118EC7E: sierra::nalu::LowMachEquationSystem::solve_and_update() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xEDBBDC: sierra::nalu::EquationSystems::solve_and_update() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xEA6D68: sierra::nalu::Realm::advance_time_step() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xE6828F: sierra::nalu::TimeIntegrator::integrate_realm() (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==48197==    by 0xE61ACE: main (in /home/spdomin/gitHubWork/Nalu/build/naluX)
@jhux2
Copy link
Member

jhux2 commented Aug 31, 2016

Stefan,

Do any of the NaluRTest simulations show similar Valgrind warnings? How big is the production run that you ran (DOFs, processes)?

Jonathan

@spdomin
Copy link
Contributor Author

spdomin commented Aug 31, 2016

This is a tiny 750k element, sixteen core job that is not part of the regression test suite. However, my guess is that the performance/uqSlidingMesh case would demonstrate this issue as well (just make sure to reduce the termination time step to facilitate faster turn around.

However, the valgrind results clearly show the problem code calls. Why not just fix the memory issues already documented? As noted, if a debug would help, let me know.

@jhux2
Copy link
Member

jhux2 commented Aug 31, 2016

This is a tiny 750k element, sixteen core job that is not part of the regression test suite. However, my guess is that the performance/uqSlidingMesh case would demonstrate this issue as well (just make sure to reduce the termination time step to facilitate faster turn around.

However, the valgrind results clearly show the problem code calls. Why not just fix the memory issues already documented? As noted, if a debug would help, let me know.

That's what I intend to do, but it's helpful if I can reproduce it locally.

@spdomin
Copy link
Contributor Author

spdomin commented Aug 31, 2016

I am running the debug case now so that you have exact line numbers. Once I have that, I will send it off. If you provide a patch, I can apply and test.

I am running an older Trilinos code base with head Nalu and managed to get past this immediate impediment.

In the meantime, I will push this exact case to a common platform for you to test (private email).

@spdomin
Copy link
Contributor Author

spdomin commented Sep 2, 2016

My debug case took forever and eventually was killed... We may just need to reply on the opt runs and expertise in looking at the code in question. I will try once more to see if a debug valgrind run works. The previous case hanged at the Muelu step.

Any news on your end?

@jhux2
Copy link
Member

jhux2 commented Sep 2, 2016

This is a tiny 750k element, sixteen core job that is not part of the regression test suite. However, my guess is that the performance/uqSlidingMesh case would demonstrate this issue as well (just make sure to reduce the termination time step to facilitate faster turn around.

About how long should this simulation or the uqSlidingMesh take to run?

@jhux2
Copy link
Member

jhux2 commented Sep 3, 2016

Nothing to report yet. The simulations are really slow to run.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 4, 2016

You need to reduce the total number of time steps that the UQ performance tests run. The problem can I sent you takes two time steps.

@jhux2
Copy link
Member

jhux2 commented Sep 6, 2016

I'm building Nalu (0da09d63) against trilinos develop, 15f990a. I'm seeing many valgrind warnings in what appears to be the assembly process:

Transfer Review:         
=========================
Realm::initialize() Begin 
NonConformalInfo::search method not declared; will use BOOST_RTREE
NC Momentum options: dsFactor/robinStyle/upwind: 1 0 0
NC Continuity options: dsFactor/robinStyle: 1 0
NonConformalInfo::search method not declared; will use BOOST_RTREE
the post processing type is surface
the post processing file name: nalu_s1.dat
the post processing physics name: surface_force_and_moment
Target name(s): surface_7
Parameters used are: 0
Parameters used are: 0
Realm::ioBroker_->populate_mesh() Begin
Realm::ioBroker_->populate_mesh() End
Realm::ioBroker_->populate_field_data() Begin
Realm::ioBroker_->populate_field_data() End
Realm::create_output_mesh(): Begin
 Sorry, no field by the name turbulent_viscosity
Realm::create_output_mesh() End
 Volume  2299 min: 5.11857e-07 max: 0.138163
NonConformal alg will ghost a number of entities: 227649
EquationSystems::initialize(): Begin 
NaluMemory::EquationSystems::initialize(): myLowMach
Memory Overview: 
nalu memory: total (over all cores) current/high-water mark=       11.2384 G      11.4547 G
nalu memory:   min (over all cores) current/high-water mark=       692.383 M      729.383 M
nalu memory:   max (over all cores) current/high-water mark=       771.117 M      781.645 M
NaluMemory::EquationSystems::initialize(): MomentumEQS
Memory Overview: 
nalu memory: total (over all cores) current/high-water mark=       11.2409 G      11.4548 G
nalu memory:   min (over all cores) current/high-water mark=       692.426 M      729.383 M
nalu memory:   max (over all cores) current/high-water mark=       771.293 M      781.645 M

Here are some examples:

==5211== Conditional jump or move depends on uninitialised value(s)
==5211==    at 0x74A6706: stk::mesh::unpack_entity_info(stk::CommBuffer&, stk::mesh::BulkData const&, stk::mesh::EntityKey&, int&, std::vector<stk::mesh::Part*, std::allocator<stk::mesh::Part*> >&, std::vector<stk::mesh::Relation, std::allocator<stk::mesh::Relation> >&) (EntityCommDatabase.cpp:149)
==5211==    by 0x7211E04: stk::mesh::BulkData::unpack_not_owned_verify(stk::CommAll&, std::ostream&) (BulkData.cpp:6262)
==5211==    by 0x72142E2: stk::mesh::BulkData::comm_mesh_verify_parallel_consistency(std::ostream&) (BulkData.cpp:6729)
==5211==    by 0x720960F: stk::mesh::BulkData::check_mesh_consistency() (BulkData.cpp:4438)
==5211==    by 0x7458EB2: stk::mesh::impl::MeshModification::internal_modification_end(stk::mesh::impl::MeshModification::modification_optimization) (MeshModification.cpp:100)
==5211==    by 0x7458B20: stk::mesh::impl::MeshModification::modification_end() (MeshModification.cpp:46)
==5211==    by 0x3D21549: stk::mesh::BulkData::modification_end() (BulkData.hpp:262)
==5211==    by 0x402028C: sierra::nalu::NonConformalManager::manage_ghosting() (NonConformalManager.C:135)
==5211==    by 0x4020011: sierra::nalu::NonConformalManager::initialize() (NonConformalManager.C:99)
==5211==    by 0x3D3EEE2: sierra::nalu::Realm::initialize_non_conformal() (Realm.C:2236)
==5211==    by 0x3D34BC9: sierra::nalu::Realm::initialize() (Realm.C:484)
==5211==    by 0x3D31B84: sierra::nalu::Realms::initialize() (Realms.C:75)
==5211== 
==5211== Conditional jump or move depends on uninitialised value(s)
==5211==    at 0x3D9BA79: std::vector<int, std::allocator<int> >::resize(unsigned long) (stl_vector.h:666)
==5211==    by 0x7211ED4: stk::mesh::BulkData::unpack_not_owned_verify(stk::CommAll&, std::ostream&) (BulkData.cpp:6275)
==5211==    by 0x72142E2: stk::mesh::BulkData::comm_mesh_verify_parallel_consistency(std::ostream&) (BulkData.cpp:6729)
==5211==    by 0x720960F: stk::mesh::BulkData::check_mesh_consistency() (BulkData.cpp:4438)
==5211==    by 0x7458EB2: stk::mesh::impl::MeshModification::internal_modification_end(stk::mesh::impl::MeshModification::modification_optimization) (MeshModification.cpp:100)
==5211==    by 0x7458B20: stk::mesh::impl::MeshModification::modification_end() (MeshModification.cpp:46)
==5211==    by 0x3D21549: stk::mesh::BulkData::modification_end() (BulkData.hpp:262)
==5211==    by 0x402028C: sierra::nalu::NonConformalManager::manage_ghosting() (NonConformalManager.C:135)
==5211==    by 0x4020011: sierra::nalu::NonConformalManager::initialize() (NonConformalManager.C:99)
==5211==    by 0x3D3EEE2: sierra::nalu::Realm::initialize_non_conformal() (Realm.C:2236)
==5211==    by 0x3D34BC9: sierra::nalu::Realm::initialize() (Realm.C:484)
==5211==    by 0x3D31B84: sierra::nalu::Realms::initialize() (Realms.C:75)

and

==5211== Conditional jump or move depends on uninitialised value(s)
==5211==    at 0x41400FB: sierra::nalu::Quad3DSCS::parametric_distance(std::vector<double, std::allocator<double> > const&) (MasterElement.C:6623)
==5211==    by 0x413FADF: sierra::nalu::Quad3DSCS::isInElement(double const*, double const*, double*) (MasterElement.C:6564)
==5211==    by 0x4015BE7: sierra::nalu::NonConformalInfo::complete_search() (NonConformalInfo.C:417)
==5211==    by 0x402003D: sierra::nalu::NonConformalManager::initialize() (NonConformalManager.C:103)
==5211==    by 0x3D3EEE2: sierra::nalu::Realm::initialize_non_conformal() (Realm.C:2236)
==5211==    by 0x3D34BC9: sierra::nalu::Realm::initialize() (Realm.C:484)
==5211==    by 0x3D31B84: sierra::nalu::Realms::initialize() (Realms.C:75)
==5211==    by 0x3D04801: sierra::nalu::Simulation::initialize() (Simulation.C:144)
==5211==    by 0x3CF4AA0: main (nalu.C:162)

and

==5211== Use of uninitialised value of size 8
==5211==    at 0xB57E32D: MPIDI_CH3_EagerContigSend (in /usr/local/mpich2/1.4.1p1_gcc_4.4.7/lib/libmpich.so.3.3)
==5211==    by 0xB5E21D4: MPID_Send (in /usr/local/mpich2/1.4.1p1_gcc_4.4.7/lib/libmpich.so.3.3)
==5211==    by 0xB61BC0B: PMPI_Send (in /usr/local/mpich2/1.4.1p1_gcc_4.4.7/lib/libmpich.so.3.3)
==5211==    by 0x6AFEB0B: void Teuchos::(anonymous namespace)::sendImpl<unsigned long>(unsigned long const*, int, int, int, Teuchos::Comm<int> const&) (Teuchos_CommHelpers.cpp:800)
==5211==    by 0x6AEEFCC: void Teuchos::send<int, unsigned long>(unsigned long const*, int, int, int, Teuchos::Comm<int> const&) (Teuchos_CommHelpers.cpp:1590)
==5211==    by 0x5CDAB6D: void Tpetra::Distributor::doPosts<unsigned long>(Teuchos::ArrayRCP<unsigned long const> const&, unsigned long, Teuchos::ArrayRCP<unsigned long> const&) (Tpetra_Distributor.hpp:1374)
==5211==    by 0x5CD6C9D: void Tpetra::Distributor::doPostsAndWaits<unsigned long>(Teuchos::ArrayView<unsigned long const> const&, unsigned long, Teuchos::ArrayView<unsigned long> const&) (Tpetra_Distributor.hpp:1090)
==5211==    by 0x5CDE5E8: void Tpetra::Distributor::computeSends<long>(Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int const> const&, Teuchos::Array<long>&, Teuchos::Array<int>&) (Tpetra_Distributor.hpp:3025)
==5211==    by 0x5CD7D5F: void Tpetra::Distributor::createFromRecvs<long>(Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int const> const&, Teuchos::Array<long>&, Teuchos::Array<int>&) (Tpetra_Distributor.hpp:3081)
==5211==    by 0x5CD2295: Tpetra::Details::DistributedNoncontiguousDirectory<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const (Tpetra_DirectoryImpl_def.hpp:991)
==5211==    by 0x5CCE032: Tpetra::Details::Directory<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getEntries(Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const (Tpetra_DirectoryImpl_def.hpp:104)
==5211==    by 0x5C8AFEA: Tpetra::Directory<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int> const&) const (Tpetra_Directory_def.hpp:236)

and

==5212== Conditional jump or move depends on uninitialised value(s)
==5212==    at 0x5C6AB36: Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getLocalElement(long) const (Tpetra_Map_def.hpp:1039)
==5212==    by 0x5CD264D: Tpetra::Details::DistributedNoncontiguousDirectory<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const (Tpetra_DirectoryImpl_def.hpp:1062)
==5212==    by 0x5CCE032: Tpetra::Details::Directory<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getEntries(Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const (Tpetra_DirectoryImpl_def.hpp:104)
==5212==    by 0x5C8AFEA: Tpetra::Directory<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int> const&) const (Tpetra_Directory_def.hpp:236)
==5212==    by 0x5C6AF3B: Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getRemoteIndexList(Teuchos::ArrayView<long const> const&, Teuchos::ArrayView<int> const&) const (Tpetra_Map_def.hpp:1793)
==5212==    by 0x5CA1FFE: Tpetra::Export<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::setupSamePermuteExport(Teuchos::Array<long>&) (Tpetra_Export_def.hpp:511)
==5212==    by 0x5C9DC34: Tpetra::Export<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Export(Teuchos::RCP<Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, Teuchos::RCP<Tpetra::Map<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&) (Tpetra_Export_def.hpp:97)
==5212==    by 0x437C176: sierra::nalu::TpetraLinearSystem::beginLinearSystemConstruction() (TpetraLinearSystem.C:357)
==5212==    by 0x437D436: sierra::nalu::TpetraLinearSystem::buildElemToNodeGraph(std::vector<stk::mesh::Part*, std::allocator<stk::mesh::Part*> > const&) (TpetraLinearSystem.C:479)
==5212==    by 0x423CAA1: sierra::nalu::AssembleMomentumElemSolverAlgorithm::initialize_connectivity() (AssembleMomentumElemSolverAlgorithm.C:103)
==5212==    by 0x41E6D02: sierra::nalu::SolverAlgorithmDriver::initialize_connectivity() (SolverAlgorithmDriver.C:65)
==5212==    by 0x403B9FA: sierra::nalu::MomentumEquationSystem::initialize() (LowMachEquationSystem.C:1755)

@jhux2
Copy link
Member

jhux2 commented Sep 6, 2016

Now rebuilding Nalu against 930cce3, the Trilinos SHA1 that was the last merge with Sierra.

@jhux2
Copy link
Member

jhux2 commented Sep 6, 2016

Now rebuilding Nalu against 930cce3, the Trilinos SHA1 that was the last merge with Sierra.

@spdomin Running the small simulation ( rotatingcube750k_2blks) you sent me, I'm seeing many valgrind warnings in STK and elsewhere when I use Nalu 0da09d6 built with Trilinos 930cce3. The warnings occur during the assembly phase. I will put the full log on the CEE lan in ~jhu/forSpdomin.

I am concerned that this is not the same version of Nalu and/or Trilinos that you are using. Can you confirm whether the versions are the same?

There may be an issue in the Tpetra Distributor, but thus far, I can't get to the errors there because of the prior valgrind warnings.

@jhux2
Copy link
Member

jhux2 commented Sep 6, 2016

After reconfiguring Trilinos 930cce3 with Kokkos_ENABLE_Debug_Bounds_Check:BOOL=ON, the Nalu rotatingcube simulation without valgrind runs to completion.

@jhux2
Copy link
Member

jhux2 commented Sep 6, 2016

After reconfiguring Trilinos 930cce3 with Kokkos_ENABLE_Debug_Bounds_Check:BOOL=ON, the Nalu rotatingcube simulation without valgrind runs to completion.

My hope was to trigger an exception in the Tpetra Distributor where the valgrind logs indicate a UMR.

@jhux2
Copy link
Member

jhux2 commented Sep 8, 2016

Update: I rebuilt Trilinos development branch (696d3ef) and Nalu master (0da09d6) using the SEMS module environment: gcc 4.9.2, OpenMPI 1.10.1, boost 1.55, zlib 1.2.8, hdf5 1.8.12, netcdf 4.3.2. Additionally, @mhoemmen added bounds checking to Tpetra::MultiVector’s pack and unpack kernels where the valgrind error occurs (thanks @mhoemmen).

All of the prior STK valgrind errors are now gone. The simulation has been running on my workstation for about 17 hours, and has just finished the first momentum solve. So it may be a day or so before the simulation hits the problem area.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 8, 2016

I am using the latest github Nalu version. The memory issues that I reported through the Sierra build is from a very old Nalu version under Sierra. However, since it provided line numbers, I thought this would be useful.

I looked at your memory check. Those issues in STK will be reported to the STK team.

Thanks

@jhux2
Copy link
Member

jhux2 commented Sep 8, 2016

Thanks for the information on Nalu. What branch of Trilinos are you using - a release, master, or develop? And what's the SHA1 of the branch?

@jhux2
Copy link
Member

jhux2 commented Sep 8, 2016

I looked at your memory check. Those issues in STK will be reported to the STK team.

By the way, I'm not confident at all that the valgrind errors for STK are real. Those showed up when I used TPLs that I rolled on my own. I haven't seen the errors (yet) since I rebuilt with the SEMS TPLs.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 8, 2016

I test Trilinos nightly and have noted the same kills. However, the specific case that I am running now uses the following:

commit 6fec5c0
Author: Nathan Ellingwood [email protected]
Date: Thu Aug 18 15:49:06 2016 -0600

In my case, I did not see the STK issues, so perhaps they are new? I will still report it to the STK team since there are real issues surrounding my usage of custom ghosting.

@jhux2
Copy link
Member

jhux2 commented Sep 8, 2016

What branch are you on: develop, master, or a release?

@spdomin
Copy link
Contributor Author

spdomin commented Sep 9, 2016

Trilinos master. I was instructed to avoid the develop location, although as you have seen from recent messages, I am considering switching my testing protocol to find issues faster.

Please call me today since a lot of the above conversation seems to indicate a personal connection is required.

@jhux2
Copy link
Member

jhux2 commented Sep 9, 2016

[from @jhux2] Update: I rebuilt Trilinos development branch (696d3ef) and Nalu master (0da09d6) using the SEMS module environment: gcc 4.9.2, OpenMPI 1.10.1, boost 1.55, zlib 1.2.8, hdf5 1.8.12, netcdf 4.3.2. Additionally, @mhoemmen added bounds checking to Tpetra::MultiVector’s pack and unpack kernels where the valgrind error occurs (thanks @mhoemmen).

All of the prior STK valgrind errors are now gone. The simulation has been running on my workstation for about 17 hours, and has just finished the first momentum solve. So it may be a day or so before the simulation hits the problem area.

The rotatingcube750k_2blks simulation on 16 cores finished last night. The error reported in UnpackArrayMultiColumn did not manifest. The only errors reported are related to OpenMPI, which if memory serves, has never been clean under valgrind. Note that I was using the Trilinos "develop" branch.

[from @spdomin] Trilinos master. I was instructed to avoid the develop location, although as you have seen from recent messages, I am considering switching my testing protocol to find issues faster.

This suggests to me that there was a problem in master that has since been fixed in develop. There are now just a few diffs between master and develop:

696d3ef Tpetra::MultiVector (un)pack kernels now do bounds checking
92622a5 ROL: Changed stochastic parameters in Stefan-Boltzmann.
5f866c2 Zoltan and Zoltan2 v12.8 release notes
728c43e zoltan2:  fixes for issue #578.  Now use target number of parts from solution (or comm size, if no solution is available) to comp
1c29cbc zoltan2:  skip work that isn't useful when a rank has zero objects.
20cb1ee zoltan2:  removed requirement that at least one lower or upper bound check be specified.  Also cleaned up lines that wrapped and 
b88b2a2 zoltan2:  removed conditions that returned NULL adapter if the local number of objects was zero.  Having zero local objects is an
a19727c zoltan2:  adding a test for issue #578
fa99ef9 zoltan2:  remove silent failures from test driver. Error messages are now printed by any failing rank (not only rank 0) so that e
7aa3d76 zoltan2:  whitespace changes -- shortened lines that wrapped and wrapped.
0cc483c  shylu/basker:   removing unsed file
8a26a57  shylu/basker   minor formatting ..
4072573 fixed little bug associated with large integers
22dd741 Tpetra - kernels- fixed missing omp flag
ddfc1d3 Tpetra Kernels: Fixed a kernel that fails with Intel 17 with O3 flag
ba6126b MueLu: Fix pull request #554 (yaml-cpp TPL)

@spdomin
Copy link
Contributor Author

spdomin commented Sep 12, 2016

I test against head. If this is fixed in develop then when will head be updated? Also, what is the policy when a bug is found in master? Does an update occur ASAP?

Also, did you start testing off of develop straight out?

@jhux2
Copy link
Member

jhux2 commented Sep 12, 2016

I test against head. If this is fixed in develop then when will head be updated?

@spdomin It looks like master was updated to be sychronized with develop since my last post. @bmpersc Could you verify whether this is correct?

Also, what is the policy when a bug is found in master? Does an update occur ASAP?

If there is a serious bug in master (which I'd say this one is), we would try to patch master asap.

Also, did you start testing off of develop straight out?

Yes. It wasn't clear to me which branch you were using. In the past you've used the Trilinos development version, I assumed that's what you were using this time.

@bmpersc
Copy link
Contributor

bmpersc commented Sep 12, 2016

Yes I updated the master branch from develop this morning.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 12, 2016

Okay, I was lax in my usage of HEAD... In that last message, I should have said master... Sorry, I will be more precise next time.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 13, 2016

I have updated my Nalu executable using the following Trillions SHA-1 from master:

commit d96723d
Author: Edward G. Phillips [email protected]
Date: Mon Sep 12 13:14:33 2016 -0600

When I run the cube simulation with the following memory diagnostic:

realms:

  • name: realm_1
    activate_memory_diagnostic: yes

I still see a slight increase in max memory as a function of simulation time step:

Memory Overview at step 1 and 45:

nalu memory: total (over all cores) current/high-water mark= 5.7514 G 6.14105 G
nalu memory: total (over all cores) current/high-water mark= 7.22063 G 8.06703 G

I will let this case run out to see if it fails as before.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 13, 2016

This is really depressing. My simulation died at the same code path that it did before. In the Muelu solve at step 492 using the above Trilinos master SHA-1.

@jhux2 can you try running your case out without memory checking to see if it also dies at this step?

@spdomin Sure... I am not holding out hope that memory checking causes this given my production run died at the same step without this diagnostic active, however, will report back in the morning.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 14, 2016

With the Trilinos SHA-1:

**d96723df90660185a9d117eaa7788b085c9afd2d

And the Trilinos install as follows:

/gitHubWork/scratch_build/install/gcc4.8.5/Trilinos_stable_release/include/MueLu_XpetraOperator_decl.hpp

With the Nalu config as follows:

nalu_install_dir=/gitHubWork/scratch_build/install/gcc4.8.5
trilinos_install_dir=$nalu_install_dir/Trilinos_stable_release

I am still seeing the following memory errors:

==14974== Conditional jump or move depends on uninitialised value(s)
==14974== at 0x1EF08A1: void Kokkos::parallel_for<Tpetra::KokkosRefactor::Details::UnpackArrayMultiColumn<Kokkos::Experimental::View<double**, K
okkos::LayoutLeft, Kokkos::Serial, void>, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>,
Kokkos::Experimental::View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Tpetra::KokkosRefactor::Details::InsertOp>

(unsigned long, Tpetra::KokkosRefactor::Details::UnpackArrayMultiColumn<Kokkos::Experimental::View<double**, Kokkos::LayoutLeft, Kokkos::Serial, v
oid>, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Kokkos::Experimental::View<int cons
t*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Tpetra::KokkosRefactor::Details::InsertOp> const&, std::string const&) (in /hom
e/spdomin/gitHubWork/Nalu/build/naluX)
==14974== by 0x1F0C787: void Tpetra::KokkosRefactor::Details::unpack_array_multi_column<Kokkos::Experimental::View<double**, Kokkos::LayoutLeft,
Kokkos::Serial, void>, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Kokkos::Experimen
tal::View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Tpetra::KokkosRefactor::Details::InsertOp>(Kokkos::Experiment
al::View<double**, Kokkos::LayoutLeft, Kokkos::Serial, void> const&, Kokkos::Experimental::View<double const*, Kokkos::Device<Kokkos::Serial, Kokko
s::HostSpace>, void, void> const&, Kokkos::Experimental::View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Tp
etra::KokkosRefactor::Details::InsertOp const&, unsigned long, bool) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==14974== by 0x1FB0A7C: Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false
::unpackAndCombineNew(Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<double
const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<unsigned long const*, Kokkos::Device<Kokkos::Serial
, Kokkos::HostSpace>, void, void> const&, unsigned long, Tpetra::Distributor&, Tpetra::CombineMode) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==14974== by 0x219348E: Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>
::doTransferNew(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokko
s::HostSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::Dual
View<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial,
Kokkos::HostSpace>, void, void> const&, Tpetra::Distributor&, Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos
::Serial, Kokkos::HostSpace>, false>::ReverseOption, bool) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==14974== by 0x2190A92: Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>
::doTransfer(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Teuchos::ArrayView const&, Teuchos::ArrayView
const&, Teuchos::ArrayView const&, Teuchos::ArrayView const&, Tpetra::Distributor&, Tpetra::DistObject<double, int, long, Kok
kos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::ReverseOption) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==14974== by 0x218F6CB: Tpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>
::doImport(Tpetra::SrcDistObject const&, Tpetra::Import<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > con
st&, Tpetra::CombineMode) (in /home/spdomin/gitHubWork/Nalu/build/naluX)
==14974== by 0xF59059: Xpetra::TpetraMultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >
::doImport(Xpetra::DistObject<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::Impor
t<int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::CombineMode) (in /home/spdomin/gitHubWork
/Nalu/build/naluX)

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 14, 2016

@spdomin Would you consider trying a debug build of Trilinos?

-D CMAKE_BUILD_TYPE:STRING=DEBUG
-D Teuchos_ENABLE_DEBUG:BOOL=ON
-D Kokkos_ENABLE_Debug:BOOL=ON

This turns on bounds checking for both Kokkos::View and Teuchos::Array*, as well as other debug-mode checks. Tpetra in particular does extra checking specifically on the pack and unpack kernels when you set these options. These checks have a run-time cost, but are cheaper than running valgrind. (They should still be compatible with valgrind, but they make valgrind unnecessary for common errors.)

@mhoemmen
Copy link
Contributor

Once we fix this, we'll write a retrospective summary. Per discussion at Trilinos Leaders' Meeting today, we've listed this summary-writing activity as a separate issue, namely #632.

@mhoemmen
Copy link
Contributor

@jhux2 I posted an analysis as a comment to #631. Those Valgrind errors are a harmless side effect of the way Kokkos implements atomic_assign.

@jhux2
Copy link
Member

jhux2 commented Sep 20, 2016

@jhux2 I posted an analysis as a comment to #631. Those Valgrind errors are a harmless side effect of the way Kokkos implements atomic_assign.

I will modify MueLu to initialize the target vectors before import.

@mhoemmen
Copy link
Contributor

I will modify MueLu to initialize the target vectors before import.

This may make the Valgrind errors go away, but I'm not convinced that this is the issue. It's worth crossing off our list, though!

jhux2 added a commit that referenced this issue Sep 20, 2016
This is related to github issues #594 and #294.  Valgrind is issuing
spurious warnings if the target multivector of a doImport/doExport call
is uninitialized in MueLu::Hierarchy.  This commit initializes the vectors
and eliminates the warnings.

Build/Test Cases Summary
Enabled Packages: MueLu
Enabled all Forward Packages
0) MPI_DEBUG => Skipped configure, build, test due to no enabled packages! => Does not affect push readiness! (-1.00 min)
1) SERIAL_RELEASE => Skipped configure, build, test due to no enabled packages! => Does not affect push readiness! (-1.00 min)
2) MPI_SS_DEBUG => passed: passed=138,notpassed=0 (44.31 min)
3) SERIAL_SS_RELEASE => passed: passed=50,notpassed=0 (1.48 min)
@jhux2
Copy link
Member

jhux2 commented Sep 20, 2016

@spdomin I have pushed commit 10824a5, which should fix the suspicious memory reports. However, based on @mhoemmen's analysis in #631, it's unclear whether this will resolve the production run failure at step 490.

@jhux2
Copy link
Member

jhux2 commented Sep 20, 2016

Closing this bug report. Please reopen if necessary.

@jhux2 jhux2 closed this as completed Sep 20, 2016
@spdomin
Copy link
Contributor Author

spdomin commented Sep 20, 2016

Generally, it would seem that we verify a change fixed an issue before we close, right?

I will build and retest and report back tomorrow. That is, unless you have run my production cube case and know that it now runs?

@spdomin
Copy link
Contributor Author

spdomin commented Sep 20, 2016

I verified that my runs are free of the memory issues. I am testing the production run now.

Also, I had a build error in develop:

/home/spdomin/gitHubWork/Nalu/src/TpetraLinearSystem.C: In member function ‘virtual void sierra::nalu::TpetraLinearSystem::finalizeLinearSystem()’:
/home/spdomin/gitHubWork/Nalu/src/TpetraLinearSystem.C:850:49: error: ‘rcp’ is not a member of ‘Tpetra’
const Teuchos::RCPLinSys::Comm tpetraComm = Tpetra::rcp(new LinSys::Comm(bulkData.parallel()));

which I fixed by using Teuchos::rcp. Is that something new today in develop?

@jhux2
Copy link
Member

jhux2 commented Sep 20, 2016

@spdomin Thanks for the update. rcp has always been part of the Teuchos namespace, so I'm not sure how line 850 ever compiled.

@mhoemmen
Copy link
Contributor

rcp has always been part of the Teuchos namespace, so I'm not sure how line 850 ever compiled.

I made some changes last week to improve build time, by not importing so much stuff into the Tpetra namespace. Many (but weirdly, not all) Teuchos classes were getting imported into the Tpetra namespace. I got rid of those imports.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 21, 2016

Right. Build issues are over. I have the production test running and will report back later. Thanks.

@bartlettroscoe
Copy link
Member

Many (but weirdly, not all) Teuchos classes were getting imported into the Tpetra namespace. I got rid of those imports.

It is perfectly safe (and often beneficial) to import class names from one namespace to another. However, it is never safe to import non-member function names. See Appendix D.2 "Amendments for ’using’ declarations and directives":

Why constantly write Teuchos::RCP<ClassName> in all of your package's code when you can just write RCP<ClassName>. But the same does not apply for the non-member function rcp(). See the wraning in Item 10 in the "C++ Coding Standards" book.

@mhoemmen
Copy link
Contributor

@bartlettroscoe My eventual goal is to stop including all those Teuchos header files in Tpetra_ConfigDefs.hpp, since more includes means longer build times. In order for Tpetra_ConfigDefs.hpp not to include so many Teuchos header files, I first had to get rid of the "using Teuchos::$THING" declarations in Tpetra_ConfigDefs.hpp. I could put those "using" declarations elsewhere, but it's not clear where they should go in order to avoid duplication. I would really rather just not have them. This only affects Tpetra developers; users shouldn't need to care.

@ambrad
Copy link
Contributor

ambrad commented Sep 26, 2016

@jhux2, @mhoemmen: @spdomain writes in #632 that Nalu is still failing at the same step. Therefore, I'm reopening this issue.

@ambrad ambrad reopened this Sep 26, 2016
@jhux2
Copy link
Member

jhux2 commented Sep 26, 2016

@spdomin How is the simulation failing? Can you tell if the simulation is running out of memory?

@rppawlo
Copy link
Contributor

rppawlo commented Sep 28, 2016

@ambrad thought this might be related to a Drekar issue we are seeing. Looking back, about the same time @spdomin started seeing this issue, Drekar started to have solver issues too. It manifested as growth in memory over multiple solves. @pwxy reported the following today in his efforts to track this down:

Here is more information concerning the memory growth issue during the Newton steps when TFQMR is used on the drekar 3D MHD generator test case (70x10x10 elements run on a single MPI process). drekar is built with OpenMPI 1.8.8 and gnu 4.9.3

When I switch from tpetra to epetra/aztec/ML/ifpack ILU, for a 3-level ML after 10 Newton steps, the memory hardly grows, increasing by only 0.7% over the end of the first Newton step.

If I stay with epetra/aztec/ML, but swap ifpack ILU for ML SGS smoother, for a 3-level ML after 10 Newton steps, the memory doesn't grow compared to the end of the first Newton step.

If I stay with epetra/aztec/ML, and keep ifpack ILU smoother, but force ML to have only one level (so a 1-level additive schwarz domain decomposition preconditioner), after 10 Newton steps, the memory hardly grows, increasing by only 0.5% over the end of the first Newton step.

So epetra/aztec/ML behaves the way one would expect in terms of memory change during Newton steps with TFQMR krylov solver (just a tiny memory growth---less than 1%)

However, tpetra/belos/muelu/ifpack2 RILUK is a different story

If we run tpetra/belos/muelu/ifpack2 SGS smoother (not RILUK), and we force muelu to have only 1-level (so a 1-level additive schwarz domain decomposition preconditioner), then everything is hunky-dory --- after 10 newton steps of TFQMR the memory does not grow as one would expect.

However, for tpetra/belos/muelu/ifpack2 RILUK but we force muelu to have only 1-level (so a 1-level additive schwarz domain decomposition preconditioner), the memory increases by 17% over the end of the first Newton step. Not good.

For tpetra/belos/muelu/ifpack2 SGS (not RILUK), for a 3-level MueLu after 10 Newton steps, the memory increases by 4.4% over the end of the first Newton step. Compare with the epetra/aztec/ML with ML SGS smoother were there was no memory growth.

For tpetra/belos/muelu/ifpack2 RILUK, for a 3-level MueLu after 10 Newton steps, the memory increases by 25% over the end of the first Newton step. Not good. Compare this with epetra/aztec/ML where the memory grew by only 0.7%

This limited data points to a potential memory issue with ifpack2 RILUK. However, MueLu with ifpack2 SGS has larger memory growth than it should.

Thanks
-paul

@trilinos/panzer @eric-c-cyr

@rppawlo
Copy link
Contributor

rppawlo commented Sep 28, 2016

I believe the issue above is being tracked under #558

@ambrad is looking into this now. Thanks Andrew!

@ambrad
Copy link
Contributor

ambrad commented Sep 28, 2016

To clarify, #558 is a separate issue, at least in symptom. I'm looking into both simultaneously because I suspect similar analyses of the same examples may yield the solution to both problems, whether or not they are related.

@pwxy
Copy link

pwxy commented Sep 28, 2016

#558 is a separate issue. Nalu doesn't use the ifpack2 AdditiveSchwarz with overlap.

@pwxy
Copy link

pwxy commented Sep 28, 2016

I think the memory growth issue with tpetra/belos/muelu/ifpack2 RILUK that I'm seeing with drekar is different from the issue in nalu. The former issue exists in the February 2016 trilinos code base.

@ambrad
Copy link
Contributor

ambrad commented Sep 29, 2016

I'm almost ready to push the fix to #558 and can confirm that issue #558 is definitely unrelated to the two memory-growth issues reported here.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jan 17, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot.
Projects
None yet
Development

No branches or pull requests

10 participants