Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos: TeamPolicy<Cuda> team size is too large #8727

Closed
jewatkins opened this issue Feb 10, 2021 · 22 comments
Closed

Kokkos: TeamPolicy<Cuda> team size is too large #8727

jewatkins opened this issue Feb 10, 2021 · 22 comments
Assignees
Labels
type: bug The primary issue is a bug in Trilinos code or tests

Comments

@jewatkins
Copy link
Contributor

Bug Report

@trilinos/kokkos
I will update the issue with other packages if we're able to narrow it down.

Description

A couple of performance tests are failing in Albany in our GPU runs on weaver:
https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=12122

Here is the error message:

 Kokkos::TeamPolicy< Cuda > the team size is too large. Team size x vector length must be smaller than 1024.
 Traceback functionality not available

and the relevant issue in Albany is here: sandialabs/Albany#661

I've narrowed down the issue to where Albany master does not work with kokkos-promotion-3.3: e4f6782
but it works with the previous commit: 7b95b7c

Albany doesn't currently use TeamPolicy so my guess is that it is a trilinos package causing the issue. Unfortunately, I wasn't able to narrow down the package or kernel with kokkos tools. @ndellingwood do you have any ideas of what might have caused this or maybe you know of a way I could narrow it down?

Steps to Reproduce

  1. SHA1: e4f6782
  2. Weaver modules: https:/SNLComputation/Albany/blob/master/doc/dashboards/weaver.sandia.gov/weaver_modules_cuda.sh
  3. Trilinos configure script: https:/SNLComputation/Albany/blob/master/doc/dashboards/weaver.sandia.gov/do-cmake-weaver-trilinos
  4. Albany configure script: can provide if needed
  5. Input deck: can provide if needed
@jewatkins jewatkins added the type: bug The primary issue is a bug in Trilinos code or tests label Feb 10, 2021
@jewatkins
Copy link
Contributor Author

Also, this seems to happen only for the larger cases. The nightly tests don't seem to catch the issue and there are some smaller cases which work fine.

@jhux2
Copy link
Member

jhux2 commented Feb 10, 2021

@trilinos/kokkos @crtrott @DavidPoliakoff

@ccober6
Copy link
Contributor

ccober6 commented Feb 16, 2021

@crtrott @DavidPoliakoff Do you know the status of this? Have a question today at the SART meeting about it. Thanks Curt

@DavidPoliakoff
Copy link

Hey @ccober6 , I've not yet taken a look at this, apologies. I'll try to free up some time tomorrow on this

@ccober6
Copy link
Contributor

ccober6 commented Feb 16, 2021

Great! @ikalash was asking about the status as Albany performance tests are dependent on it.

@alanw0
Copy link
Contributor

alanw0 commented Feb 18, 2021

Tagging @sayerhs @Tasmit @overfelt also.

@DavidPoliakoff
Copy link

Sorry, I missed this yesterday, just getting a Trilinos build going. Could somebody email me the Albany configure details? Unless we have a Trilinos reproducer

@DavidPoliakoff
Copy link

Also, what's nvcc_wrapper_volta?

@jewatkins
Copy link
Contributor Author

Sorry, I missed this yesterday, just getting a Trilinos build going. Could somebody email me the Albany configure details? Unless we have a Trilinos reproducer

I put everything on weaver here: /home/projects/albany/forDavid
Let me know if you're having issues accessing. Edit do-cmake-albany-cuda-sfad4 to point to your trilinos install dir and the submission scripts need to be edited to point to the albany build directory. Run submit_populate.sh first, that should work fine. submit.sh should give the failure.

@jewatkins
Copy link
Contributor Author

Also, what's nvcc_wrapper_volta?

We have our own nvcc_wrapper here: https:/SNLComputation/Albany/blob/master/doc/dashboards/weaver.sandia.gov/nvcc_wrapper_volta
which we update when needed. That's sometimes led to issues... Maybe I should check if it's up-to-date.

@DavidPoliakoff
Copy link

@jewatkins : thanks! Just got the Trilinos build going, will try the rest shortly. Didn't look like nvcc_wrapper_volta caused any grief

@DavidPoliakoff
Copy link

Hey all, sorry for the delay, I modified Kokkos to show what kernel it gets that error from (in a hacky way, we need a better one). But briefly, every time we run a kernel, I save off its name, and now Kokkos prints that name before the error of a too large team size. Error I'm getting says

Ran into a team size problem with kernel KokkosKernels::Common::FillReverseMap
Kokkos::TeamPolicy< Cuda > the team size is too large. Team size x vector length must be smaller than 1024.

I'm game to dig in to that kernel, but there are much wiser minds than mine on the Kernels team, we should decide whether we want me to dig in, or somebody from kernels

@sayerhs
Copy link

sayerhs commented Feb 19, 2021

@lucbv @srajama1 @jhux2 Hope you guys are following this discussion. This is affecting ExaWind/Nalu-Wind also.

Relevant parts of stacktrace:

MPT: #12 0x00000000005f4618 in Kokkos::Impl::throw_runtime_exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .cold.23] ()
MPT: #13 0x0000000004c358bc in Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Kokkos::Cuda>::TeamPolicyInternal(int, int, int) ()
MPT: #14 0x00000000056a0826 in void KokkosKernels::Impl::sort_crs_graph<Kokkos::Cuda, Kokkos::View<int*, Kokkos::CudaUVMSpace>, Kokkos::View<int*, Kokkos::CudaUVMSpace> >(Kokkos::View<int*, Kokkos::CudaUVMSpace> const&, Kokkos::View<int*, Kokkos::CudaUVMSpace> const&) ()
MPT: #15 0x0000000005a8f625 in KokkosSparse::Impl::PointGaussSeidel<KokkosKernels::Experimental::KokkosKernelsHandle<unsigned long const, int const, double const, Kokkos::Cuda, Kokkos::CudaUVMSpace, Kokkos::CudaUVMSpace>, Kokkos::View<unsigned long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<int const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double*, Kokkos::CudaUVMSpace> >::initialize_symbolic() ()
MPT: #16 0x0000000005a9bd91 in KokkosSparse::Impl::GAUSS_SEIDEL_SYMBOLIC<KokkosKernels::Experimental::KokkosKernelsHandle<unsigned long const, int const, double const, Kokkos::Cuda, Kokkos::CudaUVMSpace, Kokkos::CudaUVMSpace>, Kokkos::View<unsigned long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<int const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<1u> >, false, true>::gauss_seidel_symbolic(KokkosKernels::Experimental::KokkosKernelsHandle<unsigned long const, int const, double const, Kokkos::Cuda, Kokkos::CudaUVMSpace, Kokkos::CudaUVMSpace>*, int, int, Kokkos::View<unsigned long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<int const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<1u> >, bool) ()
MPT: #17 0x00000000029af516 in void KokkosSparse::Experimental::gauss_seidel_symbolic<KokkosKernels::Experimental::KokkosKernelsHandle<unsigned long const, int, double, Kokkos::Cuda, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >, Kokkos::View<unsigned long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<0u> > >(KokkosKernels::Experimental::KokkosKernelsHandle<unsigned long const, int, double, Kokkos::Cuda, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >*, KokkosKernels::Experimental::KokkosKernelsHandle<unsigned long const, int, double, Kokkos::Cuda, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::const_nnz_lno_t, KokkosKernels::Experimental::KokkosKernelsHandle<unsigned long const, int, double, Kokkos::Cuda, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::const_nnz_lno_t, Kokkos::View<unsigned long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<0u> >, bool) ()
MPT: #18 0x00000000029d511d in Ifpack2::Relaxation<Tpetra::RowMatrix<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >::initialize() ()
MPT: #19 0x000000000090a9e6 in sierra::nalu::TpetraLinearSolver::setupLinearSolver(Teuchos::RCP<Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >, Teuchos::RCP<Tpetra::CrsMatrix<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >, Teuchos::RCP<Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >, Teuchos::RCP<Tpetra::MultiVector<double, int, long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >) ()
MPT:     at /projects/hfm/exawind/nalu-wind-testing/nalu-wind/src/LinearSolver.C:109
MPT: #20 0x0000000000ad4bb2 in sierra::nalu::TpetraLinearSystem::finalizeLinearSystem() ()
MPT:     at /projects/hfm/exawind/nalu-wind-testing/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.4.0/trilinos-develop-bqhbzfd5da5bh72ga6pqw533jterltch/include/Teuchos_RCPNode.hpp:216
MPT: #21 0x0000000000955df3 in sierra::nalu::MomentumEquationSystem::initialize()
MPT:     ()
MPT:     at /projects/hfm/exawind/nalu-wind-testing/nalu-wind/src/LowMachEquationSystem.C:2471

@jewatkins
Copy link
Contributor Author

@sayerhs that stacktrace confirms my suspicions that it might be MTGS in ifpack2. Do you mind sharing how you were able to get it?

@sayerhs
Copy link

sayerhs commented Feb 19, 2021

@jewatkins You can look at the full build notes here. I don't think we are doing anything special other than building with RelWithDebInfo. I am cc-ing @jrood-nrel who maintains nalu-wind nightly testing in case I misspoke.

Also, unlike what @jewatkins mentioned, we are hitting this in our nightly tests. Interestingly, this is happening on one of the smaller problems, whereas the larger problems seem to be working fine.

@jewatkins
Copy link
Contributor Author

@jewatkins You can look at the full build notes here. I don't think we are doing anything special other than building with RelWithDebInfo. I am cc-ing @jrood-nrel who maintains nalu-wind nightly testing in case I misspoke.

Interesting, it looks like it might have something to do with MPT. I'll try RelWithDebInfo but maybe it's not reproducible on weaver. Thanks.

@jrood-nrel
Copy link

MPT just puts a prefix on output from ranks. I don't think our case has anything unique to MPT.

@jewatkins
Copy link
Contributor Author

FYI, we just ran into this issue as well in sparc when trying MTGS

@brian-kelley
Copy link
Contributor

Sorry everyone, this bug got fixed in KokkosKernels develop but missed the 3.3.1 promotion by a few days. I made the fake TeamPolicy (not to execute anything, but to ask Kokkos heuristics about the correct team size) with too many threads, but the TeamPolicy constructor is where that gets checked.

I'm patching in kokkos/kokkos-kernels#872 to Trilinos now.

brian-kelley added a commit to brian-kelley/Trilinos that referenced this issue Feb 19, 2021
(fix trilinos#8727, TeamPolicy team size too large in sort_crs_*)
@DavidPoliakoff
Copy link

Just confirming, we're the same stacktrace, so fixing one is fixing both

brian-kelley added a commit to brian-kelley/Trilinos that referenced this issue Feb 19, 2021
(fix trilinos#8727, TeamPolicy team size too large in sort_crs_*)
Adds the KokkosKernels unit test that replicated this issue.
@brian-kelley brian-kelley self-assigned this Feb 19, 2021
@brian-kelley
Copy link
Contributor

@jewatkins @sayerhs OK, the patch was merged into develop. Can you verify that it fixed the issue?

@jewatkins
Copy link
Contributor Author

I've been testing this manually and it works so I'm okay with closing this. Thanks Brian and all!

csiefer2 pushed a commit that referenced this issue Feb 26, 2021
(fix #8727, TeamPolicy team size too large in sort_crs_*)
Adds the KokkosKernels unit test that replicated this issue.
kddevin added a commit that referenced this issue Apr 27, 2021
…ement (#8821)

* Tpetra: add new user-friendly MV view access

Also add new "owningView_" DualView member that refers to
the actual original DV (not a subview of anything else). This
is the DualView to sync in order to maintain consistency regardless
of how MultiVectors alias each other.

4 new view accessor functions: getLocalView[Host|Device][Non]Const()

- Respect constness
- Manage syncs and modifies for the user
- Prevent taking out a view in one space while any view in the other
space is live.
- Existing getLocalView()/getLocalViewHost()/getLocalViewDevice() just
have the reference count checking added (no sync/modify). This has no
effect for HostSpace or CudaUVMSpace since those host mirrors match the
device views.

* Tpetra - fix MV test 14.

* Tpetra - fix item 17

* Tpetra - fix item 20

* Tpetra - fix item 23

* Tpetra - fix item 28

* Tpetra - fix item 29

* Tpetra - fix item 35

* Tpetra - workaround for item 30

* Tpetra: Modifying Bug7758 test to use the new getLocalViewHostConst (which will make sure things are actually sync'd)

* Tpetra: fix MV [un]pack to respect host/device refcounts

* fix nonconst in Bug7745

* Tpetra: stashing

* Tpetra - issue 354 fix

* Tpetra: refactor sameObject so it doesn't simultaneously ask for host and device views

* Tpetra: remove static_assert, fix getLocalView() ret type

Remove bad static_assert that tripped for Cuda/CudaUVMSpace build.
Correct MultiVector::getLocalView() return type to be exactly consistent with
DualView::view().

* tpetra:  fixed error in MultiVector pack that caused failures with UVM=ON

* tpetra:  Fix for FEMultivector -- rather than take the subview of a
DualView and create a new vector with it, use the MultiVector
constructor that gets "offset" views of a vector (in which
@brian-kelley has the owningView_ working correctly).
While I was at it, I added a swap of the owningView_ to the MultiVector
swap() function.

* Tpetra: Fixing ImportExport/Issue3968:  The tests uses sync_to* without changing the modify flags, which mucks up our internal tracking

* tpetra:  fix to work without UVM

* tpetra:  changed getLocalViewHost/Device to new Const/NonConst versions
as appropriate. #8591
Did not change getLocalView as the Const/NonConst versions of
getLocalView do not exist yet
Did not change MV_reduce_strided to avoid creating conflicts for
@brian-kelley

* tpetra: change getLocalViewHost to appropriate Const/NonConst version #8591

* Tpetra: Modifying MultiVector to remove all references to old getLocalViewX functions

* Tpetra: More getLocalView mods

* Tpetra: Lots and lots of fixes to tests to use the new getLocalView<thing>Const/NonConst functions

* Tpetra: Fixing scaleBlockDiagonal signature as per Brian

* Tpetra: Fixes to the BlockView test to work correctly with UVM=OFF

* Tpetra: Fixing MultiVector print outs for help with non-unified memory debugging

* Tpetra - missing getlocal view "device"

* Tpetra: public Access:: ReadOnly/ReadWrite/WriteOnly

Make WithLocalAccess use these tags instead of internal Details:: ones.
These will also be used for the new MultiVector view access interface.

* moving from getLocalView... to getLocalView...(Tpetra::Accesspattern)

* Tpetra - get1dview logic change

* Tpetra, WIP: using new tagged view access

* Tpetra: use new interface for all MV getLocalView

* tpetra:  removed unneeded include file

* Tpetra: Tags!

* Tpetra: Tags!

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing tests

* Tpetra: Fixing tests

* Tpetra: Fixing tests

* tpetra: copied implementation of getLocalViewHost and getLocalViewDevice
from templated getLocalView, as the getLocalView version does not work.
This commit may be temporary, but it allows us to make progress on other
bugs while someone figures out the template-fu.
Sorry for the debugging statements; we'll get rid of those eventually.

* adding localview tests

* tpetra:  getLocalView<template> now works.
cleaned up my obnoxious print statements
kept Host and Device implementations that do NOT use getLocalView.

* tpetra:  added Tpetra::Access to many getLocalView<> instances
Tests still pass with UVM=ON.

* Tpetra: Removing the dreaded parantheses from the Access tags

* Manually intercept UVM allocations, throw exception

Effectively makes it impossible for any UVM allocations to
exist (except for Stokhos, which calls cudaMallocManaged directly)

* Tpetra: Deprecate old getLocalView functions

* Allow UVM allocations when Kokkos_ENABLE_CUDA_UVM=ON

* tpetra:  changed getLocalView to use access tags and getLocalViewDevice

* tpetra:  added access tags to getLocalView(); fixed scope of some pointers

* xpetra:  fixes to allow compilation

* WIP: deprecate getLocalBlock and start adding tagged overloads

* Tpetra: rewrite allReduceView to work with non-UVM

allReduceView had one bug and one sub-optimal thing:
- Tried to make a view copy with both layout and device different -
  Kokkos can't do that in a single deep_copy
- If a LayoutStride -> contiguous copy needed to be made, it always used
  LayoutLeft. If one of the input/output views was LayoutStride and the
  other was LayoutRight, they would both be copied to LayoutLeft. Now, use
  LayoutRight in this case.

Some utilities to help manage layouts and MPI + Kokkos views in general
are in the new file temporaryViewUtils.hpp: layout unification,
making a contiguous view, and making an MPI-safe view.
In the future these can be used to clean up idot and
iallreduce without losing efficiency.

* Tpetra:  Block MultiVector correctly uses getLocalView; removed stored pointer

* fix host device type for const_little_host_vec_type

* tpetra:  clean up of BlockMultiVector fixes

* Tpetra:  deprecated held pointer mvData_

* tpetra:  removed modifies without syncs; fixed MueLu tests

* Tpetra - removing sync in ScaleAndAssign test

* Tpetra - unit test is okay without modify and sync flags

* Tpetra - test passes without modify and sync operations

* Tpetra - remove unnecessary sync modify clear state flags

* Tpetra - remove multi vector sync/modify/ things

* Tpetra - remove sync modify things in other places

* Tpetra: remove withLocalAccess, for_each, transform

The new MV::getLocalView interface is a simpler substitute for these.

* Issue 8391. Switched to C++17 standard for GCC 8.3 build.

* FROSch: Convert enum NullSpaceType to scoped enum

By converting the enum to an enum class NullSpaceType, one is forced to
use the enum class and cannot replace it with integers anymore. This
guarantees, that the expressive enum class is used in implementations
rather than the implicitly encoded integers.

* Patch in KokkosKernels #872

(fix #8727, TeamPolicy team size too large in sort_crs_*)
Adds the KokkosKernels unit test that replicated this issue.

* MueLu: Adding Aggregate size percentiles to AggregateQuality

* Moved Tpetra CRS GS into Ifpack2 Relaxation

* Moved BlockCrs GS functionality into Relaxation

* Enabled new local GS code for CRS

* Reduce redundant code in CRS (GS/SGS use same fn)

* Using refactored block CRS local apply, unify GS/SGS

* More refactoring to get rid of redundant functions

* Added required syncs/modifies for vectors

* Removed unneeded !constantStride paths

* Use cached MV to replace getColumnMapMV from CrsMatrix

* Ifpack2: remove unneeded includes

* Ifpack2: undo some find-and-replace in comments

Undoing some "Node" -> "node_type"

* MueLu: undo CMake change, should be its own PR

* MueLu: in configure, print out missing ETI setting

During configure, MueLu prints out the type combinations to ETI.
Add <complex, int, long long> to this, since it was missing.

* tpetra:  treat WriteOnly of subviews as ReadOnly.

* Ifpack2: in RBILUK, use tagged BMV::getLocalBlock

* Tpetra: add comment with caveat

on BMV::getLocalBlock(i, j, WriteOnly)

* tpetra: separated BugTests.cpp into separate test files so that we can
disable them separately (since they exercise different classes).

* Ifpack2: update BMV getLocalBlock calls

to use tagged access, and not use manual sync/modify (which has been
removed). With UVM, all Tpetra,Belos,Ifpack2,MueLu tests pass.

* more test changes

* mv localview tests

* wrapped up 6 tests for new behaviors

* tpetra:  scoping fix for Bug7234.cpp;
more output from getLocalView* when error occurs, as in parallel runs,
throw messages weren't always printed (e.g., from doExport when only
3/4 processors failed)

* Tpetra: add MV::aliases(const MV& other)

This allows a user to see if two MVs overlap, without actually getting
the local views and possibly hitting the reference count checker.

* Ifpack2: const correctness, use new getLocalView

- Throughout Ifpack2, remove manual sync/modify and calls to deprecated
  getLocalView. Use tagged getLocalView instead.
- In BlockRelaxation and the Containers, change interfaces to use const
  on views and multivectors that aren't actually modified

* Tpetra: fix one MV LocalView test, comment out another

We will make sure fix is OK, then uncomment and fix the other

* tpetra:  enable some Tpetra tests without UVM

* tpetra:  fix test for non-Cuda builds

* Ifpack2: fix more constness of apply vectors

* Kokkos: allow CudaUVMSpace::allocate again

Roll back change that made CudaUVMSpace::allocate throw
when UVM was not the default memory space for Cuda.

* tpetra:  changes needed to build with DEPRECATED_CODE=OFF #8821

* fix remaining test

* Tpetra - fix for nox failure

* Thyra: added missing fences to euclidean apply operations used
in MvTimesMatAddMv; the fences resolve test failures with
CUDA_LAUNCH_BLOCKING=0 and cleaner sync/modify in tpetra @rppawlo

Tpetra: the fences above provide a more surgical fix to the test
errors seen in #8821; this commit removes fences from
getLocalView*(ReadOnly).  @kyungjoo-kim

Belos: preventive fence added with @hkthorn's blessing
to mimic those in Thyra.

* tpetra: added fence between device kernels and retrieving blocks on host #8821

* Ifpack2: Minor fix

* DualView: make fencing behavior in sync consistent

sync<Device>() does extra exec space fences if the dev/host memory
spaces are the same. This was missing in sync_host/sync_device, so
this adds it there. Makes all Ifpack2 tests for UVM without launch
blocking.

* tpetra:  exercise the Teuchos-based interfaces, too

* changed access control from WriteOnly to OverwriteAll because semantics mean things

* WIP: fixing idot for MV dualview refactor

And some udpates to ifpack2 and amesos2 about that.
Working around Kokkos issue #3850 where the templated getLocalView was
used.

* WIP: idot/iallreduce cleanup

* Tpetra: finish idot/iallreduce refactor

* Fixed iallreduce test for non-uvm device

* Belos: use new Tpetra MV view interface

* Cleanup

* Remove extra dualview sync fences

* Ifpack2 passes without launch blocking

except RBILUK.

* Ifpack2: add temporary fence in RBILUK for BlockCrs

Later it should be possible to replace this fence with a refactored
DualView interface to BlockCrs.

* Tpetra: add a global reduce to a test so it will fail when only one proc is failing

* Tpetra: fix some typos in a Map unit test

* Tpetra: remove deprecated sync/modify calls from a unit test

* Ifpack2: fix impl_scalar/scalar mismatch

* Tpetra: remove/update remaining mentions of Gauss-Seidel

* Tpetra: fix iallreduce for builds without MPI

* Ifpack2: revert commenting out try/catch

Was causing unused var warning

* Ifpack2: Fixing vector mode mistake

* tpetra, ifpack2:  fixing several access mode errors

* Tpetra: use new MV view interface in Bug8794 test

* Amesos2: revert using tagged Tpetra MV getLocalView

for some reason, using ReadOnly tag to access MV view in
TpetraMultivecAdapter caused solve solution to not get copied back to
the Tpetra multivector. This is surprising because the views were just
used as the source for a Kokkos deep copy, and this caused
BlockRelaxation in Ifpack2 to fail for serial node (in which DualViews
are trivial, and all kernels are synchronous)

* Ifpack2: add back tag clobbered by merge

* kokkos:  patch from kokkos/kokkos#3857

* comment out all the instances of TPETRA_DEPRECATED (#9023)

* MueLu: add fence for recent intrepid2 changes

Fixes MueLu-Intrepid2 unit tests, uvm, no launch blocking.

* Tpetra: restore MV_reduce_strided test.

Key: use the MV (map, dualview, orig_dualview) constructor instead of the
(map, dualview) constructor. If $dualview is noncontiguous, the first one
lets you pass orig_dualview as the contiguous super-view containing
dualview, and orig_dualview can be sync'd without problems.

Also modify TempView::toLayout() to test span_is_contiguous, rather than
assuming that (Layout != LayoutStride) implies contiguous.

* tpetra:  Removed deprecated sync_device calls

* Tpetra: Remove some MultiVector that were checking modification state (#9032)

* Tpetra: Deprecate need_sync* in MultiVector

* Tpetra: for now, we won't deprecate need_sync_host/device

* tpetra:  removed instantiations of removed tests

* Tpetra: don't use CudaSpace in nonblocking collectives

OpenMPI does not support Cuda device buffers for nonblocking collectives
like MPI_Iallreduce, even with a Cuda-aware installation.

* Fix old typo in Ifpack2_UnitTestBlockRelaxation

* Fix access tag: OverwriteAll -> ReadWrite

Tpetra::COPY takes src then dst (opposite order to Kokkos deep_copy) so Y_cur is being read at first and written later.

* Undo bad DualView merge

Co-authored-by: Brian Kelley <[email protected]>
Co-authored-by: Kyungjoo Kim <[email protected]>
Co-authored-by: Chris Siefert <[email protected]>
Co-authored-by: Geoff Danielson <[email protected]>
Co-authored-by: Timothy A. Smith <[email protected]>
Co-authored-by: James M. Willenbring <[email protected]>
Co-authored-by: Matthias Mayr <[email protected]>
Co-authored-by: Timothy Smith <[email protected]>
kddevin added a commit that referenced this issue May 4, 2021
* Tempus: Remove ParameterList from IntegratorBasic

Remove all the internal uses of ParameterList from IntegratorBasic.
This means moving the variables in the IntegratorBasic ParameterList
to member data. Integrator will not longer inherit from
Teuchos::ParameterListAcceptor. However IntegratorBasic can still
be built from a ParameterList, and will still provide a valid
ParameterList.

 * To break up these changes, created a copy of IntegratorBasic
   (i.e., IntegratorBasicOld) for the sensitivity analysis integrators
    - IntegratorAdjointSensitivity
    - IntegratorForwardSensitivity
    - IntegratorPseudoTransientAdjointSensitivity
    - IntegratorPseudoTransientForwardSensitivity
   so these can be upgraded in another PR.
 * IntegratorBasic is no longer inherited from ParameterListAcceptor.
    - Removed setParameterList.
    - IntegratorBasic constructors using ParameterLists have moved to
      nonmember constructors, e.g.,
      . integratorBasic(pl, model) --> createIntegratorBasic(pl, model)
      . IntegratorBasic(pl, model) --> createIntegratorBasic(pl, model)
    - Member data ParameterLists are removed.
    - Kept getValidParameters(), which now returns a ParameterList
      with the current values. Still matches ParameterListAcceptor
      signature.
 * Ensured that ParameterList names were correctly set, so
   getValidParameters() could be used to create nested ParameterLists,
   e.g., IntegratorBasic->Stepper->Solver.
 * Made getValidParametersBasic() a member functions of Stepper class.
 * Simplified setting the Stepper to just setStepper(stepper).
 * Added method to set model on stepper in IntegratorBasic,
   i.e., setModel(model).
 * The integrator observer is no longer a composite observer.
   It is simply a base class observer.
 * All internal IntegratorBasic references to member ParameterLists
   are changed to member data.
 * Added member data for the integrator name and type.  Name is a
   label that used for identification, e.g., 'My Integrator Basic'.
   Type defines the derived class being used, e.g., 'Integrator Basic'.
 * Added a shallow copy for the SolutionHistory.

* Tempus: Remove ParameterList from Internals of IntegratorBasic.

 * Changed Piro and ROL to use IntegratorBasicOld.  Will move to
   IntegratorBasic in future PR.
 * Added documentation on StepperName and StepperType to help
   distinguish between them.
 * setStepperType() is now a protected function of the Stepper
   class, which should help distinguish it against StepperName.
 * IMEX_RK and IMEX_RK_Partition now requires the stepperType
   in the default constructor to completely build it.  These
   should be changed to have a base class IMEX_RK and derived
   classes for each stepperType (similarly for IMEX_RK_Partition).
 * Fixed several misuses of stepperName and stepperType in source
   code and in unit tests.
 * Fixed some usage of Stepper aliases.

* Tpetra: add backup of scripts for perf testing

eclipse, vortex and stria env/build scripts.
These are used on SRN Jenkins and Watchr.

* cherry pick Kokkos-kernels PR #921
 Two-stage GS: add damping factors #921

* expose new options for two-stage GS from Ifpack2

* describe the two-stage parameters more in comments

* MueLu: Enable reuse of Ifpack2 smoothers

* Add openmpi 4.0.5 toolchain for VAN1

User Support Ticket(s) or Story Referenced: SPAR-969

* Add ctest drivers for new toolchains

* Correct an ordering issue and add tests

* ATDM/van1-tx2: Disable build stats

* re-basing muelu gold files with new two-stage gs parameters

* Replace VerifyExecutionCanAccessMemorySpace usage

Teuchos, Tpetra, Sacado, Stokhos: replace usage of deprecated
VerifyExecutionCanAccessMemorySpace with SpaceAccessibility for
compatibility with Kokkos.
See kokkos/kokkos#3813 for relevant changes.

* blake atdm environment: update cmake to 3.19.3

* Ifpack2 Hypre: Fix link errors when multiple node types are used

* Fixes mismatched new/delete and memory leaks in reused solver objects

Integrating the templated Basker solver directly into Xyce to perform
custom linear solves for harmonic balance (HB) analysis, some memory
issues were noticed.  First there is a mismatch in the new and delete
used to create and destroy the pinv object, respectively.  If the same
solver object is reused, the L and U factors are leaked.  Also some
internal workspace is not properly cleaned up.  Due to clarity the
internal workspace objects have been refactored and the L and U
factors are deleted if the solver object is called to perform a
numeric factorization when one already exists.

This addresses the memory issues that have been observed through
valgrind.

* casting inner-damping factor to complex only if scalar-type is complex.

* MueLu CreateOperator test: Set verbosity so that default values are ignored

* MueLu: Update gold files

* Revert "re-basing muelu gold files with new two-stage gs parameters"

This reverts commit feea4d8.

* trilinos_couplings: Replace VerifyExecutionCanAccessMemorySpace

replace usage of deprecated VerifyExecutionCanAccessMemorySpace
with SpaceAccessibility for compatibility with Kokkos.
See kokkos/kokkos#3813 for relevant changes.

* SEACAS: Fix warnings and memory leaks

Automatic snapshot commit from seacas at 29acd7f151

Origin repo remote tracking branch: 'origin/master'
Origin repo remote repo URL: 'origin = https:/gsjaardema/seacas'

At commit:

commit 29acd7f1510bf729084274fe0cead3ef5e815dd8
Author:  Greg Sjaardema <[email protected]>
Date:    Wed Apr 14 10:29:28 2021 -0600
Summary: sync back sierra-build changes [ci skip]

    EXODIFF: Eliminate edge/face block memory leaks
    PLT: Support for flang/f18
    APREPRO: Better support for array memory leak management
    Fix compilation warnings on nvidia
    IOSS: Eliminate long compile when sanitizer enabled
    IOSS: Eliminate compiler warning
    APREPRO: Eliminate array memory leaks

* Ctest: Fixing emailer for ascicgpu031

* Automatic snapshot commit from tribits at 18aed92

Origin repo remote tracking branch: 'github/master'
Origin repo remote repo URL: 'github = [email protected]:TriBITSPub/TriBITS.git'

At commit:

commit 18aed92550ebc8e8e0f7da2a5d38cd4eaa192e1f
Author:  Roscoe A. Bartlett <[email protected]>
Date:    Thu Apr 15 09:45:52 2021 -0600
Summary: Allow TRIBITS_ADD_ADVANCED_TEST_MAX_NUM_TEST_BLOCKS to be changed (#136)

* Tempus: Replace Logical Operators

The ROL team has received requests from Windows users to remove the
alternative logical operators 'and' and 'or' in favor of && and ||.
The C++ standard includes 'and' and 'or', of course. However, the
MS compiler only supports them in the -permissive mode, which
apparently isn't allowed in many companies, due to the quality
control niceties that the non-permissive (standard) mode provides.

Trilinos is not supporting Windows and do not have even platforms
to test on. This change should be straight forward, but there is
not any mechanism to prevent regression. Microsoft needs to support
the c++ standard!

Not sure all instances were found and there is no compiler flag
to throw on this.

* Galeri Xpetra: Add anisotrpic diffusion problem

* MueLu: fix agg export for multiple dofs per node

* muelu:  changes needed to handle Kokkos::complex in Cuda builds

* Turn off 3 Zoltan tests that fail  due to a bug in spectrummpi

See #8798 for the discussion

* Geminga CUDA nightly: Enable complex

* Add short reason for the disablement and note the issue for more details

* Galeri: Fix issue in boundary conditions

* MueLu RefMaxwell: Pass corrected nullspace to coarse (1,1) hierarchy

* Xpetra: Add EpetraInverseOperator

Its 'apply' calls 'ApplyInverse' instead of 'Apply'

* TrilinosCouplings: Fix scaling of CurlCurl in Maxwell example

* tpetra:  adding test of branch Tpetra_UVM_Removal for SAKE

* STK: Snapshot 04-21-21 11:17 (#9039)

* add a few more timers

* make default for GmresSingleReduce as single-reduce MGS and no Newton basis

* make "delayed normalization" default for single-reduce MGS

* Intrepid2: fix 8801 (second attempt) (#9044)

* Intrepid2: resolve uninitialized variable warnings.

* Intrepid2: move lambda implementation into a functor to work around apparent CUDA compiler bug.  PR #9044, fix for #8801 (second attempt).

* When compiling with IBM Clang 11 + Cuda, Zoltan2 MJ captures 'this' inside lambdas

This patch
* Marks 1 function static, which was class const (but used not class variables).
* addresses this-> being used inside a nested team lambda (which clang doesn't like)
  To remove this, I moved sEpsilon as parameter to the function being called and
  was able to mark the function static as well (removing its use of this->sEpsilon)
  This entails creating a funciton-locally copy of sEpsilon so that it may be
  captured by the default [=] capture.

Both changes entail adding a `using` statement within the function so that the
now static class functions can be called (e.g., AlgMJ< ... >::

* Snapshot of kokkos.git from commit 04b8196e0e3bfc4cee4047dbbbb13fc227730fe8

From repository at [email protected]:kokkos/kokkos.git

At commit:
commit 04b8196e0e3bfc4cee4047dbbbb13fc227730fe8
Merge: 1fb0c284 ffc35a82
Author: Nathan Ellingwood <[email protected]>
Date:   Mon Apr 26 00:14:56 2021 -0600

    Merge branch 'release-candidate-3.4.0' for 3.4.00

    Part of Kokkos C++ Performance Portability Programming EcoSystem 3.4

* Snapshot of kokkos-kernels.git from commit 3eb6a9298b58f224b876b6e29cda4491cddc53c5

From repository at [email protected]:kokkos/kokkos-kernels.git

At commit:
commit 3eb6a9298b58f224b876b6e29cda4491cddc53c5
Merge: fe439b21 dd0d4ef8
Author: Nathan Ellingwood <[email protected]>
Date:   Mon Apr 26 00:16:08 2021 -0600

    Merge branch 'release-candidate-3.4.0' for 3.4.00

    Part of Kokkos C++ Performance Portability Programming EcoSystem 3.4

* Turn on Intrepid2 per #8310

* MueLu: Add test for aggregation export

* Ifpack2 Relaxation&Chebyshev: Use offsets in more cases

* Framework: update messaging on issue autocloser bot

* Framework: Update autocloser throttle limit to 70

* MueLu: removed unused lines from agg export test

* Tpetra MultiVector and BlockMultiVector refactor to remove UVM requirement (#8821)

* Tpetra: add new user-friendly MV view access

Also add new "owningView_" DualView member that refers to
the actual original DV (not a subview of anything else). This
is the DualView to sync in order to maintain consistency regardless
of how MultiVectors alias each other.

4 new view accessor functions: getLocalView[Host|Device][Non]Const()

- Respect constness
- Manage syncs and modifies for the user
- Prevent taking out a view in one space while any view in the other
space is live.
- Existing getLocalView()/getLocalViewHost()/getLocalViewDevice() just
have the reference count checking added (no sync/modify). This has no
effect for HostSpace or CudaUVMSpace since those host mirrors match the
device views.

* Tpetra - fix MV test 14.

* Tpetra - fix item 17

* Tpetra - fix item 20

* Tpetra - fix item 23

* Tpetra - fix item 28

* Tpetra - fix item 29

* Tpetra - fix item 35

* Tpetra - workaround for item 30

* Tpetra: Modifying Bug7758 test to use the new getLocalViewHostConst (which will make sure things are actually sync'd)

* Tpetra: fix MV [un]pack to respect host/device refcounts

* fix nonconst in Bug7745

* Tpetra: stashing

* Tpetra - issue 354 fix

* Tpetra: refactor sameObject so it doesn't simultaneously ask for host and device views

* Tpetra: remove static_assert, fix getLocalView() ret type

Remove bad static_assert that tripped for Cuda/CudaUVMSpace build.
Correct MultiVector::getLocalView() return type to be exactly consistent with
DualView::view().

* tpetra:  fixed error in MultiVector pack that caused failures with UVM=ON

* tpetra:  Fix for FEMultivector -- rather than take the subview of a
DualView and create a new vector with it, use the MultiVector
constructor that gets "offset" views of a vector (in which
@brian-kelley has the owningView_ working correctly).
While I was at it, I added a swap of the owningView_ to the MultiVector
swap() function.

* Tpetra: Fixing ImportExport/Issue3968:  The tests uses sync_to* without changing the modify flags, which mucks up our internal tracking

* tpetra:  fix to work without UVM

* tpetra:  changed getLocalViewHost/Device to new Const/NonConst versions
as appropriate. #8591
Did not change getLocalView as the Const/NonConst versions of
getLocalView do not exist yet
Did not change MV_reduce_strided to avoid creating conflicts for
@brian-kelley

* tpetra: change getLocalViewHost to appropriate Const/NonConst version #8591

* Tpetra: Modifying MultiVector to remove all references to old getLocalViewX functions

* Tpetra: More getLocalView mods

* Tpetra: Lots and lots of fixes to tests to use the new getLocalView<thing>Const/NonConst functions

* Tpetra: Fixing scaleBlockDiagonal signature as per Brian

* Tpetra: Fixes to the BlockView test to work correctly with UVM=OFF

* Tpetra: Fixing MultiVector print outs for help with non-unified memory debugging

* Tpetra - missing getlocal view "device"

* Tpetra: public Access:: ReadOnly/ReadWrite/WriteOnly

Make WithLocalAccess use these tags instead of internal Details:: ones.
These will also be used for the new MultiVector view access interface.

* moving from getLocalView... to getLocalView...(Tpetra::Accesspattern)

* Tpetra - get1dview logic change

* Tpetra, WIP: using new tagged view access

* Tpetra: use new interface for all MV getLocalView

* tpetra:  removed unneeded include file

* Tpetra: Tags!

* Tpetra: Tags!

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing tests

* Tpetra: Fixing tests

* Tpetra: Fixing tests

* tpetra: copied implementation of getLocalViewHost and getLocalViewDevice
from templated getLocalView, as the getLocalView version does not work.
This commit may be temporary, but it allows us to make progress on other
bugs while someone figures out the template-fu.
Sorry for the debugging statements; we'll get rid of those eventually.

* adding localview tests

* tpetra:  getLocalView<template> now works.
cleaned up my obnoxious print statements
kept Host and Device implementations that do NOT use getLocalView.

* tpetra:  added Tpetra::Access to many getLocalView<> instances
Tests still pass with UVM=ON.

* Tpetra: Removing the dreaded parantheses from the Access tags

* Manually intercept UVM allocations, throw exception

Effectively makes it impossible for any UVM allocations to
exist (except for Stokhos, which calls cudaMallocManaged directly)

* Tpetra: Deprecate old getLocalView functions

* Allow UVM allocations when Kokkos_ENABLE_CUDA_UVM=ON

* tpetra:  changed getLocalView to use access tags and getLocalViewDevice

* tpetra:  added access tags to getLocalView(); fixed scope of some pointers

* xpetra:  fixes to allow compilation

* WIP: deprecate getLocalBlock and start adding tagged overloads

* Tpetra: rewrite allReduceView to work with non-UVM

allReduceView had one bug and one sub-optimal thing:
- Tried to make a view copy with both layout and device different -
  Kokkos can't do that in a single deep_copy
- If a LayoutStride -> contiguous copy needed to be made, it always used
  LayoutLeft. If one of the input/output views was LayoutStride and the
  other was LayoutRight, they would both be copied to LayoutLeft. Now, use
  LayoutRight in this case.

Some utilities to help manage layouts and MPI + Kokkos views in general
are in the new file temporaryViewUtils.hpp: layout unification,
making a contiguous view, and making an MPI-safe view.
In the future these can be used to clean up idot and
iallreduce without losing efficiency.

* Tpetra:  Block MultiVector correctly uses getLocalView; removed stored pointer

* fix host device type for const_little_host_vec_type

* tpetra:  clean up of BlockMultiVector fixes

* Tpetra:  deprecated held pointer mvData_

* tpetra:  removed modifies without syncs; fixed MueLu tests

* Tpetra - removing sync in ScaleAndAssign test

* Tpetra - unit test is okay without modify and sync flags

* Tpetra - test passes without modify and sync operations

* Tpetra - remove unnecessary sync modify clear state flags

* Tpetra - remove multi vector sync/modify/ things

* Tpetra - remove sync modify things in other places

* Tpetra: remove withLocalAccess, for_each, transform

The new MV::getLocalView interface is a simpler substitute for these.

* Issue 8391. Switched to C++17 standard for GCC 8.3 build.

* FROSch: Convert enum NullSpaceType to scoped enum

By converting the enum to an enum class NullSpaceType, one is forced to
use the enum class and cannot replace it with integers anymore. This
guarantees, that the expressive enum class is used in implementations
rather than the implicitly encoded integers.

* Patch in KokkosKernels #872

(fix #8727, TeamPolicy team size too large in sort_crs_*)
Adds the KokkosKernels unit test that replicated this issue.

* MueLu: Adding Aggregate size percentiles to AggregateQuality

* Moved Tpetra CRS GS into Ifpack2 Relaxation

* Moved BlockCrs GS functionality into Relaxation

* Enabled new local GS code for CRS

* Reduce redundant code in CRS (GS/SGS use same fn)

* Using refactored block CRS local apply, unify GS/SGS

* More refactoring to get rid of redundant functions

* Added required syncs/modifies for vectors

* Removed unneeded !constantStride paths

* Use cached MV to replace getColumnMapMV from CrsMatrix

* Ifpack2: remove unneeded includes

* Ifpack2: undo some find-and-replace in comments

Undoing some "Node" -> "node_type"

* MueLu: undo CMake change, should be its own PR

* MueLu: in configure, print out missing ETI setting

During configure, MueLu prints out the type combinations to ETI.
Add <complex, int, long long> to this, since it was missing.

* tpetra:  treat WriteOnly of subviews as ReadOnly.

* Ifpack2: in RBILUK, use tagged BMV::getLocalBlock

* Tpetra: add comment with caveat

on BMV::getLocalBlock(i, j, WriteOnly)

* tpetra: separated BugTests.cpp into separate test files so that we can
disable them separately (since they exercise different classes).

* Ifpack2: update BMV getLocalBlock calls

to use tagged access, and not use manual sync/modify (which has been
removed). With UVM, all Tpetra,Belos,Ifpack2,MueLu tests pass.

* more test changes

* mv localview tests

* wrapped up 6 tests for new behaviors

* tpetra:  scoping fix for Bug7234.cpp;
more output from getLocalView* when error occurs, as in parallel runs,
throw messages weren't always printed (e.g., from doExport when only
3/4 processors failed)

* Tpetra: add MV::aliases(const MV& other)

This allows a user to see if two MVs overlap, without actually getting
the local views and possibly hitting the reference count checker.

* Ifpack2: const correctness, use new getLocalView

- Throughout Ifpack2, remove manual sync/modify and calls to deprecated
  getLocalView. Use tagged getLocalView instead.
- In BlockRelaxation and the Containers, change interfaces to use const
  on views and multivectors that aren't actually modified

* Tpetra: fix one MV LocalView test, comment out another

We will make sure fix is OK, then uncomment and fix the other

* tpetra:  enable some Tpetra tests without UVM

* tpetra:  fix test for non-Cuda builds

* Ifpack2: fix more constness of apply vectors

* Kokkos: allow CudaUVMSpace::allocate again

Roll back change that made CudaUVMSpace::allocate throw
when UVM was not the default memory space for Cuda.

* tpetra:  changes needed to build with DEPRECATED_CODE=OFF #8821

* fix remaining test

* Tpetra - fix for nox failure

* Thyra: added missing fences to euclidean apply operations used
in MvTimesMatAddMv; the fences resolve test failures with
CUDA_LAUNCH_BLOCKING=0 and cleaner sync/modify in tpetra @rppawlo

Tpetra: the fences above provide a more surgical fix to the test
errors seen in #8821; this commit removes fences from
getLocalView*(ReadOnly).  @kyungjoo-kim

Belos: preventive fence added with @hkthorn's blessing
to mimic those in Thyra.

* tpetra: added fence between device kernels and retrieving blocks on host #8821

* Ifpack2: Minor fix

* DualView: make fencing behavior in sync consistent

sync<Device>() does extra exec space fences if the dev/host memory
spaces are the same. This was missing in sync_host/sync_device, so
this adds it there. Makes all Ifpack2 tests for UVM without launch
blocking.

* tpetra:  exercise the Teuchos-based interfaces, too

* changed access control from WriteOnly to OverwriteAll because semantics mean things

* WIP: fixing idot for MV dualview refactor

And some udpates to ifpack2 and amesos2 about that.
Working around Kokkos issue #3850 where the templated getLocalView was
used.

* WIP: idot/iallreduce cleanup

* Tpetra: finish idot/iallreduce refactor

* Fixed iallreduce test for non-uvm device

* Belos: use new Tpetra MV view interface

* Cleanup

* Remove extra dualview sync fences

* Ifpack2 passes without launch blocking

except RBILUK.

* Ifpack2: add temporary fence in RBILUK for BlockCrs

Later it should be possible to replace this fence with a refactored
DualView interface to BlockCrs.

* Tpetra: add a global reduce to a test so it will fail when only one proc is failing

* Tpetra: fix some typos in a Map unit test

* Tpetra: remove deprecated sync/modify calls from a unit test

* Ifpack2: fix impl_scalar/scalar mismatch

* Tpetra: remove/update remaining mentions of Gauss-Seidel

* Tpetra: fix iallreduce for builds without MPI

* Ifpack2: revert commenting out try/catch

Was causing unused var warning

* Ifpack2: Fixing vector mode mistake

* tpetra, ifpack2:  fixing several access mode errors

* Tpetra: use new MV view interface in Bug8794 test

* Amesos2: revert using tagged Tpetra MV getLocalView

for some reason, using ReadOnly tag to access MV view in
TpetraMultivecAdapter caused solve solution to not get copied back to
the Tpetra multivector. This is surprising because the views were just
used as the source for a Kokkos deep copy, and this caused
BlockRelaxation in Ifpack2 to fail for serial node (in which DualViews
are trivial, and all kernels are synchronous)

* Ifpack2: add back tag clobbered by merge

* kokkos:  patch from kokkos/kokkos#3857

* comment out all the instances of TPETRA_DEPRECATED (#9023)

* MueLu: add fence for recent intrepid2 changes

Fixes MueLu-Intrepid2 unit tests, uvm, no launch blocking.

* Tpetra: restore MV_reduce_strided test.

Key: use the MV (map, dualview, orig_dualview) constructor instead of the
(map, dualview) constructor. If $dualview is noncontiguous, the first one
lets you pass orig_dualview as the contiguous super-view containing
dualview, and orig_dualview can be sync'd without problems.

Also modify TempView::toLayout() to test span_is_contiguous, rather than
assuming that (Layout != LayoutStride) implies contiguous.

* tpetra:  Removed deprecated sync_device calls

* Tpetra: Remove some MultiVector that were checking modification state (#9032)

* Tpetra: Deprecate need_sync* in MultiVector

* Tpetra: for now, we won't deprecate need_sync_host/device

* tpetra:  removed instantiations of removed tests

* Tpetra: don't use CudaSpace in nonblocking collectives

OpenMPI does not support Cuda device buffers for nonblocking collectives
like MPI_Iallreduce, even with a Cuda-aware installation.

* Fix old typo in Ifpack2_UnitTestBlockRelaxation

* Fix access tag: OverwriteAll -> ReadWrite

Tpetra::COPY takes src then dst (opposite order to Kokkos deep_copy) so Y_cur is being read at first and written later.

* Undo bad DualView merge

Co-authored-by: Brian Kelley <[email protected]>
Co-authored-by: Kyungjoo Kim <[email protected]>
Co-authored-by: Chris Siefert <[email protected]>
Co-authored-by: Geoff Danielson <[email protected]>
Co-authored-by: Timothy A. Smith <[email protected]>
Co-authored-by: James M. Willenbring <[email protected]>
Co-authored-by: Matthias Mayr <[email protected]>
Co-authored-by: Timothy Smith <[email protected]>

* ascicgpu031: Testing updates

* MueLu: agg export does not play well with kokkos_aggregates

* Ctest: Adding Belos to email script

* Ctest: Adding Belos to email script

* Ctest: Adding Belos to email script

* Setting Tpetra Deprecated Code = ON #9067

* Ifpack & Ifpack2: Fix tiny bug in L1 method

* KokkosKernels: Fix bug in Serial specialization of spmv

Will only hit spmvs with a beta of exactly -1 using the Serial backend.

* Ifpack2: Add single kernel for diagonal extraction, L1 and small entry fix

* Disable tests in the UVM Off build

* MueLu: correct for variation due to roundoff in Convex Hulls

* Ifpack2 Relaxation: Add missing typedefs

Co-authored-by: Curtis C. Ober <[email protected]>
Co-authored-by: Brian Kelley <[email protected]>
Co-authored-by: iyamaza <[email protected]>
Co-authored-by: Christian Glusa <[email protected]>
Co-authored-by: Samuel Browne <[email protected]>
Co-authored-by: Evan Harvey <[email protected]>
Co-authored-by: Evan Harvey <[email protected]>
Co-authored-by: Nathan Ellingwood <[email protected]>
Co-authored-by: trilinos-autotester <[email protected]>
Co-authored-by: Heidi K. Thornquist <[email protected]>
Co-authored-by: Christian Glusa <[email protected]>
Co-authored-by: Jonathan Hu <[email protected]>
Co-authored-by: gsjaardema <[email protected]>
Co-authored-by: Chris Siefert <[email protected]>
Co-authored-by: Roscoe A. Bartlett <[email protected]>
Co-authored-by: Peter Ohm <[email protected]>
Co-authored-by: Paul Wolfenbarger <[email protected]>
Co-authored-by: Alan Williams <[email protected]>
Co-authored-by: iyamazaki <[email protected]>
Co-authored-by: Nate Roberts <[email protected]>
Co-authored-by: James J. Elliott <[email protected]>
Co-authored-by: Jennifer Loe <[email protected]>
Co-authored-by: Henry Swantner <[email protected]>
Co-authored-by: Christian Trott <[email protected]>
Co-authored-by: William McLendon <[email protected]>
Co-authored-by: Kyungjoo Kim <[email protected]>
Co-authored-by: Geoff Danielson <[email protected]>
Co-authored-by: Timothy A. Smith <[email protected]>
Co-authored-by: James M. Willenbring <[email protected]>
Co-authored-by: Matthias Mayr <[email protected]>
Co-authored-by: Timothy Smith <[email protected]>
Co-authored-by: James Elliott <[email protected]>
jrobcary pushed a commit to Tech-XCorp/Trilinos that referenced this issue May 4, 2021
…ement (trilinos#8821)

* Tpetra: add new user-friendly MV view access

Also add new "owningView_" DualView member that refers to
the actual original DV (not a subview of anything else). This
is the DualView to sync in order to maintain consistency regardless
of how MultiVectors alias each other.

4 new view accessor functions: getLocalView[Host|Device][Non]Const()

- Respect constness
- Manage syncs and modifies for the user
- Prevent taking out a view in one space while any view in the other
space is live.
- Existing getLocalView()/getLocalViewHost()/getLocalViewDevice() just
have the reference count checking added (no sync/modify). This has no
effect for HostSpace or CudaUVMSpace since those host mirrors match the
device views.

* Tpetra - fix MV test 14.

* Tpetra - fix item 17

* Tpetra - fix item 20

* Tpetra - fix item 23

* Tpetra - fix item 28

* Tpetra - fix item 29

* Tpetra - fix item 35

* Tpetra - workaround for item 30

* Tpetra: Modifying Bug7758 test to use the new getLocalViewHostConst (which will make sure things are actually sync'd)

* Tpetra: fix MV [un]pack to respect host/device refcounts

* fix nonconst in Bug7745

* Tpetra: stashing

* Tpetra - issue 354 fix

* Tpetra: refactor sameObject so it doesn't simultaneously ask for host and device views

* Tpetra: remove static_assert, fix getLocalView() ret type

Remove bad static_assert that tripped for Cuda/CudaUVMSpace build.
Correct MultiVector::getLocalView() return type to be exactly consistent with
DualView::view().

* tpetra:  fixed error in MultiVector pack that caused failures with UVM=ON

* tpetra:  Fix for FEMultivector -- rather than take the subview of a
DualView and create a new vector with it, use the MultiVector
constructor that gets "offset" views of a vector (in which
@brian-kelley has the owningView_ working correctly).
While I was at it, I added a swap of the owningView_ to the MultiVector
swap() function.

* Tpetra: Fixing ImportExport/Issue3968:  The tests uses sync_to* without changing the modify flags, which mucks up our internal tracking

* tpetra:  fix to work without UVM

* tpetra:  changed getLocalViewHost/Device to new Const/NonConst versions
as appropriate. trilinos#8591
Did not change getLocalView as the Const/NonConst versions of
getLocalView do not exist yet
Did not change MV_reduce_strided to avoid creating conflicts for
@brian-kelley

* tpetra: change getLocalViewHost to appropriate Const/NonConst version trilinos#8591

* Tpetra: Modifying MultiVector to remove all references to old getLocalViewX functions

* Tpetra: More getLocalView mods

* Tpetra: Lots and lots of fixes to tests to use the new getLocalView<thing>Const/NonConst functions

* Tpetra: Fixing scaleBlockDiagonal signature as per Brian

* Tpetra: Fixes to the BlockView test to work correctly with UVM=OFF

* Tpetra: Fixing MultiVector print outs for help with non-unified memory debugging

* Tpetra - missing getlocal view "device"

* Tpetra: public Access:: ReadOnly/ReadWrite/WriteOnly

Make WithLocalAccess use these tags instead of internal Details:: ones.
These will also be used for the new MultiVector view access interface.

* moving from getLocalView... to getLocalView...(Tpetra::Accesspattern)

* Tpetra - get1dview logic change

* Tpetra, WIP: using new tagged view access

* Tpetra: use new interface for all MV getLocalView

* tpetra:  removed unneeded include file

* Tpetra: Tags!

* Tpetra: Tags!

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing more tests

* Tpetra: Fixing tests

* Tpetra: Fixing tests

* Tpetra: Fixing tests

* tpetra: copied implementation of getLocalViewHost and getLocalViewDevice
from templated getLocalView, as the getLocalView version does not work.
This commit may be temporary, but it allows us to make progress on other
bugs while someone figures out the template-fu.
Sorry for the debugging statements; we'll get rid of those eventually.

* adding localview tests

* tpetra:  getLocalView<template> now works.
cleaned up my obnoxious print statements
kept Host and Device implementations that do NOT use getLocalView.

* tpetra:  added Tpetra::Access to many getLocalView<> instances
Tests still pass with UVM=ON.

* Tpetra: Removing the dreaded parantheses from the Access tags

* Manually intercept UVM allocations, throw exception

Effectively makes it impossible for any UVM allocations to
exist (except for Stokhos, which calls cudaMallocManaged directly)

* Tpetra: Deprecate old getLocalView functions

* Allow UVM allocations when Kokkos_ENABLE_CUDA_UVM=ON

* tpetra:  changed getLocalView to use access tags and getLocalViewDevice

* tpetra:  added access tags to getLocalView(); fixed scope of some pointers

* xpetra:  fixes to allow compilation

* WIP: deprecate getLocalBlock and start adding tagged overloads

* Tpetra: rewrite allReduceView to work with non-UVM

allReduceView had one bug and one sub-optimal thing:
- Tried to make a view copy with both layout and device different -
  Kokkos can't do that in a single deep_copy
- If a LayoutStride -> contiguous copy needed to be made, it always used
  LayoutLeft. If one of the input/output views was LayoutStride and the
  other was LayoutRight, they would both be copied to LayoutLeft. Now, use
  LayoutRight in this case.

Some utilities to help manage layouts and MPI + Kokkos views in general
are in the new file temporaryViewUtils.hpp: layout unification,
making a contiguous view, and making an MPI-safe view.
In the future these can be used to clean up idot and
iallreduce without losing efficiency.

* Tpetra:  Block MultiVector correctly uses getLocalView; removed stored pointer

* fix host device type for const_little_host_vec_type

* tpetra:  clean up of BlockMultiVector fixes

* Tpetra:  deprecated held pointer mvData_

* tpetra:  removed modifies without syncs; fixed MueLu tests

* Tpetra - removing sync in ScaleAndAssign test

* Tpetra - unit test is okay without modify and sync flags

* Tpetra - test passes without modify and sync operations

* Tpetra - remove unnecessary sync modify clear state flags

* Tpetra - remove multi vector sync/modify/ things

* Tpetra - remove sync modify things in other places

* Tpetra: remove withLocalAccess, for_each, transform

The new MV::getLocalView interface is a simpler substitute for these.

* Issue 8391. Switched to C++17 standard for GCC 8.3 build.

* FROSch: Convert enum NullSpaceType to scoped enum

By converting the enum to an enum class NullSpaceType, one is forced to
use the enum class and cannot replace it with integers anymore. This
guarantees, that the expressive enum class is used in implementations
rather than the implicitly encoded integers.

* Patch in KokkosKernels trilinos#872

(fix trilinos#8727, TeamPolicy team size too large in sort_crs_*)
Adds the KokkosKernels unit test that replicated this issue.

* MueLu: Adding Aggregate size percentiles to AggregateQuality

* Moved Tpetra CRS GS into Ifpack2 Relaxation

* Moved BlockCrs GS functionality into Relaxation

* Enabled new local GS code for CRS

* Reduce redundant code in CRS (GS/SGS use same fn)

* Using refactored block CRS local apply, unify GS/SGS

* More refactoring to get rid of redundant functions

* Added required syncs/modifies for vectors

* Removed unneeded !constantStride paths

* Use cached MV to replace getColumnMapMV from CrsMatrix

* Ifpack2: remove unneeded includes

* Ifpack2: undo some find-and-replace in comments

Undoing some "Node" -> "node_type"

* MueLu: undo CMake change, should be its own PR

* MueLu: in configure, print out missing ETI setting

During configure, MueLu prints out the type combinations to ETI.
Add <complex, int, long long> to this, since it was missing.

* tpetra:  treat WriteOnly of subviews as ReadOnly.

* Ifpack2: in RBILUK, use tagged BMV::getLocalBlock

* Tpetra: add comment with caveat

on BMV::getLocalBlock(i, j, WriteOnly)

* tpetra: separated BugTests.cpp into separate test files so that we can
disable them separately (since they exercise different classes).

* Ifpack2: update BMV getLocalBlock calls

to use tagged access, and not use manual sync/modify (which has been
removed). With UVM, all Tpetra,Belos,Ifpack2,MueLu tests pass.

* more test changes

* mv localview tests

* wrapped up 6 tests for new behaviors

* tpetra:  scoping fix for Bug7234.cpp;
more output from getLocalView* when error occurs, as in parallel runs,
throw messages weren't always printed (e.g., from doExport when only
3/4 processors failed)

* Tpetra: add MV::aliases(const MV& other)

This allows a user to see if two MVs overlap, without actually getting
the local views and possibly hitting the reference count checker.

* Ifpack2: const correctness, use new getLocalView

- Throughout Ifpack2, remove manual sync/modify and calls to deprecated
  getLocalView. Use tagged getLocalView instead.
- In BlockRelaxation and the Containers, change interfaces to use const
  on views and multivectors that aren't actually modified

* Tpetra: fix one MV LocalView test, comment out another

We will make sure fix is OK, then uncomment and fix the other

* tpetra:  enable some Tpetra tests without UVM

* tpetra:  fix test for non-Cuda builds

* Ifpack2: fix more constness of apply vectors

* Kokkos: allow CudaUVMSpace::allocate again

Roll back change that made CudaUVMSpace::allocate throw
when UVM was not the default memory space for Cuda.

* tpetra:  changes needed to build with DEPRECATED_CODE=OFF trilinos#8821

* fix remaining test

* Tpetra - fix for nox failure

* Thyra: added missing fences to euclidean apply operations used
in MvTimesMatAddMv; the fences resolve test failures with
CUDA_LAUNCH_BLOCKING=0 and cleaner sync/modify in tpetra @rppawlo

Tpetra: the fences above provide a more surgical fix to the test
errors seen in trilinos#8821; this commit removes fences from
getLocalView*(ReadOnly).  @kyungjoo-kim

Belos: preventive fence added with @hkthorn's blessing
to mimic those in Thyra.

* tpetra: added fence between device kernels and retrieving blocks on host trilinos#8821

* Ifpack2: Minor fix

* DualView: make fencing behavior in sync consistent

sync<Device>() does extra exec space fences if the dev/host memory
spaces are the same. This was missing in sync_host/sync_device, so
this adds it there. Makes all Ifpack2 tests for UVM without launch
blocking.

* tpetra:  exercise the Teuchos-based interfaces, too

* changed access control from WriteOnly to OverwriteAll because semantics mean things

* WIP: fixing idot for MV dualview refactor

And some udpates to ifpack2 and amesos2 about that.
Working around Kokkos issue trilinos#3850 where the templated getLocalView was
used.

* WIP: idot/iallreduce cleanup

* Tpetra: finish idot/iallreduce refactor

* Fixed iallreduce test for non-uvm device

* Belos: use new Tpetra MV view interface

* Cleanup

* Remove extra dualview sync fences

* Ifpack2 passes without launch blocking

except RBILUK.

* Ifpack2: add temporary fence in RBILUK for BlockCrs

Later it should be possible to replace this fence with a refactored
DualView interface to BlockCrs.

* Tpetra: add a global reduce to a test so it will fail when only one proc is failing

* Tpetra: fix some typos in a Map unit test

* Tpetra: remove deprecated sync/modify calls from a unit test

* Ifpack2: fix impl_scalar/scalar mismatch

* Tpetra: remove/update remaining mentions of Gauss-Seidel

* Tpetra: fix iallreduce for builds without MPI

* Ifpack2: revert commenting out try/catch

Was causing unused var warning

* Ifpack2: Fixing vector mode mistake

* tpetra, ifpack2:  fixing several access mode errors

* Tpetra: use new MV view interface in Bug8794 test

* Amesos2: revert using tagged Tpetra MV getLocalView

for some reason, using ReadOnly tag to access MV view in
TpetraMultivecAdapter caused solve solution to not get copied back to
the Tpetra multivector. This is surprising because the views were just
used as the source for a Kokkos deep copy, and this caused
BlockRelaxation in Ifpack2 to fail for serial node (in which DualViews
are trivial, and all kernels are synchronous)

* Ifpack2: add back tag clobbered by merge

* kokkos:  patch from kokkos/kokkos#3857

* comment out all the instances of TPETRA_DEPRECATED (trilinos#9023)

* MueLu: add fence for recent intrepid2 changes

Fixes MueLu-Intrepid2 unit tests, uvm, no launch blocking.

* Tpetra: restore MV_reduce_strided test.

Key: use the MV (map, dualview, orig_dualview) constructor instead of the
(map, dualview) constructor. If $dualview is noncontiguous, the first one
lets you pass orig_dualview as the contiguous super-view containing
dualview, and orig_dualview can be sync'd without problems.

Also modify TempView::toLayout() to test span_is_contiguous, rather than
assuming that (Layout != LayoutStride) implies contiguous.

* tpetra:  Removed deprecated sync_device calls

* Tpetra: Remove some MultiVector that were checking modification state (trilinos#9032)

* Tpetra: Deprecate need_sync* in MultiVector

* Tpetra: for now, we won't deprecate need_sync_host/device

* tpetra:  removed instantiations of removed tests

* Tpetra: don't use CudaSpace in nonblocking collectives

OpenMPI does not support Cuda device buffers for nonblocking collectives
like MPI_Iallreduce, even with a Cuda-aware installation.

* Fix old typo in Ifpack2_UnitTestBlockRelaxation

* Fix access tag: OverwriteAll -> ReadWrite

Tpetra::COPY takes src then dst (opposite order to Kokkos deep_copy) so Y_cur is being read at first and written later.

* Undo bad DualView merge

Co-authored-by: Brian Kelley <[email protected]>
Co-authored-by: Kyungjoo Kim <[email protected]>
Co-authored-by: Chris Siefert <[email protected]>
Co-authored-by: Geoff Danielson <[email protected]>
Co-authored-by: Timothy A. Smith <[email protected]>
Co-authored-by: James M. Willenbring <[email protected]>
Co-authored-by: Matthias Mayr <[email protected]>
Co-authored-by: Timothy Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

8 participants