Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code gets stuck within Tpetra's doImport #1752

Closed
searhein opened this issue Sep 19, 2017 · 71 comments
Closed

Code gets stuck within Tpetra's doImport #1752

searhein opened this issue Sep 19, 2017 · 71 comments

Comments

@searhein
Copy link
Contributor

searhein commented Sep 19, 2017

I am implementing a new subpackage for ShyLU, which I have not yet pushed to the repository. Since last week, after pulling from the develop branch of the repository, my code gets stuck within the doImport() method of a Tpetra::CrsMatrix, which I call through Xpetra.
The code does not crash, it just gets completely stuck within the import.

I am able to narrow down the corresponding commit, which results in this issue, between ce22a7e and
8011004 since the exactly same code runs perfectly on commit 8011004 and gets stuck on commit ce22a7e.

@aprokop
Copy link
Contributor

aprokop commented Sep 19, 2017

@trilinos/tpetra @trilinos/shylu @trilinos/xpetra

@mhoemmen
Copy link
Contributor

@searhein Could you please specify your build options? Commit 8011004 mainly relates to CUDA builds, though it does have effects for non-CUDA builds. It's possible that in a CUDA build, the changes make users' incorrect use of UVM without intervening fences more likely to cause issues.

@mhoemmen
Copy link
Contributor

@searhein Also, I would be happy to test out your code if you wouldn't mind sharing it.

@searhein
Copy link
Contributor Author

@mhoemmen My configuration for Trilinos is

cmake \
    -D CMAKE_BUILD_TYPE:STRING=RELEASE \
    -D CMAKE_INSTALL_PREFIX:STRING="$INSTALL_DIR" \
    -D CMAKE_C_FLAGS:STRING="-DH5_HAVE_PARALLEL" \
    -D CMAKE_CXX_FLAGS:STRING="-DH5_HAVE_PARALLEL" \
    -D PYTHON_EXECUTABLE:STRING="/usr/bin/python" \
    -D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \
    -D Trilinos_ASSERT_MISSING_PACKAGES:BOOL=ON \
    -D Trilinos_ENABLE_Fortran:BOOL=ON \
    -D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
    -D Trilinos_ENABLE_Anasazi:BOOL=OFF \
    -D Trilinos_ENABLE_Amesos:BOOL=ON \
    -D Trilinos_ENABLE_Amesos2:BOOL=ON \
    -D Trilinos_ENABLE_AztecOO:BOOL=OFF \
    -D Trilinos_ENABLE_Belos:BOOL=ON \
    -D Trilinos_ENABLE_Epetra:BOOL=ON \
    -D Trilinos_ENABLE_EpetraExt:BOOL=ON \
    -D Trilinos_ENABLE_Galeri:BOOL=ON \
    -D Trilinos_ENABLE_Ifpack:BOOL=ON \
    -D Trilinos_ENABLE_Ifpack2:BOOL=OFF \
    -D Trilinos_ENABLE_Isorropia:BOOL=OFF \
    -D Trilinos_ENABLE_ML:BOOL=ON \
    -D Trilinos_ENABLE_MueLu:BOOL=ON \
    -D Trilinos_ENABLE_Pamgen:BOOL=OFF \
    -D Trilinos_ENABLE_Sacado:BOOL=OFF \
    -D Trilinos_ENABLE_Shards:BOOL=OFF \
    -D Trilinos_ENABLE_ShyLU:BOOL=ON \
    -D Trilinos_ENABLE_ShyLU_DDOS:BOOL=ON \
    -D Trilinos_ENABLE_ShyLU_DDOSXpetra:BOOL=ON \
    -D Trilinos_ENABLE_ShyLU_DDFROSch:BOOL=ON \
    -D ShyLU_DD_ENABLE_TESTS:BOOL=ON \
    -D Trilinos_ENABLE_Stratimikos:BOOL=OFF \
    -D Trilinos_ENABLE_Teuchos:BOOL=ON \
    -D Trilinos_ENABLE_Teko:BOOL=OFF \
    -D Trilinos_ENABLE_Thyra:BOOL=OFF \
    -D Trilinos_ENABLE_Tpetra:BOOL=ON \
    -D Trilinos_ENABLE_Xpetra:BOOL=ON \
    -D Trilinos_ENABLE_Zoltan:BOOL=ON \
    -D Trilinos_ENABLE_Zoltan2:BOOL=ON \
    -D TPL_ENABLE_MPI:BOOL=ON \
    -D ML_ENABLE_METIS:BOOL=ON \
    -D ML_ENABLE_ParMETIS:BOOL=ON \
    -D ML_ENABLE_TESTS:BOOL=ON \
    -D TPL_ENABLE_METIS:BOOL=ON \
    -D METIS_LIBRARY_DIRS:PATH=$PARMETIS/lib \
    -D METIS_INCLUDE_DIRS:PATH=$PARMETIS/include \
    -D TPL_ENABLE_ParMETIS:BOOL=ON \
    -D ParMETIS_LIBRARY_DIRS:PATH=$PARMETIS/lib  \
    -D ParMETIS_INCLUDE_DIRS:PATH=$PARMETIS/include \
    -D TPL_ENABLE_BLAS:BOOL=ON \
    -D TPL_BLAS_LIBRARIES:STRING='-framework Accelerate' \
    -D TPL_ENABLE_LAPACK:BOOL=ON \
    -D TPL_LAPACK_LIBRARIES:STRING='-framework Accelerate' \
    -D TPL_ENABLE_UMFPACK:BOOL=ON \
    -D TPL_UMFPACK_LIBRARIES:STRING="$UMFPACK/lib/libumfpack.a;$UMFPACK/lib/libamd.a;" \
    -D UMFPACK_INCLUDE_DIRS:PATH=$UMFPACK/include/ \
    -D TPL_ENABLE_Boost:BOOL=ON \
    -D TPL_Boost_INCLUDE_DIRS:PATH=$BOOST/include \
    -D EpetraExt_USING_HDF5:BOOL=ON \
    -D TPL_ENABLE_HDF5:BOOL=ON \
    -D HDF5_LIBRARY_DIRS:PATH=$HDF5/lib \
    -D HDF5_INCLUDE_DIRS:PATH=$HDF5/include \
    -D TPL_ENABLE_MUMPS:BOOL=ON \
    -D Amesos_ENABLE_MUMPS:BOOL=ON \
    -D Amesos2_ENABLE_MUMPS:BOOL=OFF \
    -D HAVE_AMESOS_MPI_C2F:BOOL=ON \
    -D MUMPS_INCLUDE_DIRS:FILEPATH="$MUMPS/include" \
    -D MUMPS_LIBRARY_DIRS:FILEPATH="$MUMPS/lib" \
    -D MUMPS_LIBRARY_NAMES:STRING="dmumps;pord" \
    -D Amesos_ENABLE_SCALAPACK:BOOL=ON \
    -D SCALAPACK_INCLUDE_DIRS:FILEPATH="$SCALAPACK/SRC" \
    -D SCALAPACK_LIBRARY_DIRS:FILEPATH="$SCALAPACK" \
    -D SCALAPACK_LIBRARY_NAMES:STRING="scalapack" \
    -D Amesos_ENABLE_BLACS:BOOL=ON \
    -D BLACS_INCLUDE_DIRS:FILEPATH="$BLACS/SRC/MPI" \
    -D BLACS_LIBRARY_DIRS:FILEPATH="$BLACS/LIB" \
    -D BLACS_LIBRARY_NAMES:STRING="blacsCinit_MPI-MACOS-0.a;blacsF77init_MPI-MACOS-0.a;blacs_MPI-MACOS-0.a" \
    -D Trilinos_EXTRA_LINK_FLAGS:STRING="/opt/local/lib/mpich-gcc48/libmpifort.a" \
    $BASE_DIR

I am compiling on my Macbook.

On commit 8011004, my code still runs through perfectly. For commit ce22a7e, it doesn't work anymore.

Hopefully, I can push my code to the repository today, and you will find it in the sub package ShyLU/FROSch. However, I will disable the Tpetra test in which the issue occurs.

@mhoemmen
Copy link
Contributor

@searhein I can't speak for the ShyLU management, but I would be much happier if you were to submit your code in the form of a pull request, rather than just pushing it. I would be happy to try it out, and would be even happier if we could fix the issue (mine, yours, or both) before the code actually gets added to the repo.

Just curious -- have you tried this on other platforms, and do you encounter the same issue there?

@srajama1
Copy link
Contributor

srajama1 commented Sep 19, 2017

This is a new experimental package that is being added. It works fine with Epetra which will be enabled by default. The Tpetra portion is hanging, but needs to be disabled by default until it is fixed either in ShyLU or Tpetra. A PR is good, but not absolutely needed.

@srajama1
Copy link
Contributor

Thinking one more time, a PR will be better, as you are not only changing ShyLU but also Ifpack2, Amesos2 dependencies.

@mhoemmen
Copy link
Contributor

@searhein It's totally up to you and the ShyLU team how the code shows up, but I would be happy to review it :-) .

@ndellingwood
Copy link
Contributor

If submitted as a PR @mhoemmen can pull the PR to test if he wants to try it out before a commit (he mentioned interest in testing in earlier message); here's the git commands for reference for anyone that hasn't used them but may find them useful:

For example, say @searhein submits the commit as PR #xyz

git fetch origin pull/xyz/head:pr-xyz
git checkout pr-xyz

This assumes origin is a remote for the Trilinos repo that the PR was submitted to, not the user's fork, and a local branch is created from the PR called pr-xyz.

@searhein
Copy link
Contributor Author

Ok, I will do a PR then.

@searhein
Copy link
Contributor Author

I have just submitted my pull request #1759. Unfortunately, it is rather large.

@mhoemmen
Copy link
Contributor

@searhein It would be helpful if you could summarize for me what files relate to Tpetra. Thanks!

@searhein
Copy link
Contributor Author

@mhoemmen In fact, Tpetra is used through Xpetra almost in the whole code. However, running
the test in
/packages/shylu/shylu_dd/frosch/test/TestLaplacian_Tpetra
e.g., with
mpirun -n 8 ./ShyLU_DDFROSch_frosch_laplacian_tpetra.exe,
the code gets stuck in line 90 of the file
packages/shylu/shylu_dd/frosch/src/Tools/FROSch_Tools_def.hpp, i.e., performing
matrix->doImport(*tmpMatrix,*gather,Xpetra::ADD);.

Thanks for working on fixing that issue.

@srajama1
Copy link
Contributor

FYI : This change is going in ShyLU this week with Tpetra options disabled. That said it would be a lot of help to resolve this issue soon as we are planning application integration within the next few weeks.

@kddevin
Copy link
Contributor

kddevin commented Sep 25, 2017

@mhoemmen @tjfulle do either of you have time this week to look into this issue with @searhein?
@searhein would it be possible for you to create a stripped down test that does your import (e.g., with the same maps, matrices, etc.) without all the rest of the ShyLU details?

@mhoemmen
Copy link
Contributor

@kddevin If @searhein has time to work on a stripped-down test, I'll be happy to help. It would also be nice to see the results of running in a debug build with the CMake options Teuchos_ENABLE_DEBUG:BOOL=ON and Kokkos_ENABLE_DEBUG:BOOL=ON both set.

@tjfulle
Copy link
Contributor

tjfulle commented Sep 25, 2017

@kddevin, I can help too.

@searhein
Copy link
Contributor Author

@kddevin @mhoemmen @tjfulle I just pushed another test
packages/shylu/shylu_dd/frosch/test/TestImport_Tpetra
to my PR which also gets stuck with this issue. Does this help?

@mhoemmen
Copy link
Contributor

@searhein Yes, thank you. Please be a bit patient as we are a bit overwhelmed these days.

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 28, 2017

@searhein I'm looking at the test you added:

https:/searhein/Trilinos/blob/25728b63a8e1ef8237ac6df37e19da45984f61a0/packages/shylu/shylu_dd/frosch/test/TestImport_Tpetra/main.cpp

and noticed the following issues:

  1. The test mixes Epetra and Tpetra objects, and uses them through Xpetra. I'm not sure if we could characterize this solely as a "Tpetra issue." Does Xpetra test simultaneous use of Epetra and Tpetra? Furthermore, Galeri has an Xpetra interface, so you should not need to copy from Epetra to Xpetra objects. The MueLu developers have examples that use Galeri to create Xpetra objects directly.

  2. I'm looking at line 127:

 K = Xpetra::MatrixFactory<SC,LO,GO,NO>::Build(overlappingMap,2*tmpMatrix->getGlobalMaxNumRowEntries());

The overlapping Map comes from the following function call on line 124:

FROSch::ExtendOverlapByOneLayer<SC,LO,GO,NO>(K,overlappingMap);

From the name of this function, it looks like this computes the Map for the algebraic equivalent of single level of mesh overlap. In that case, where does the constant 2*tmpMatrix->getGlobalMaxNumRowEntries() come from? That argument of Build is supposed to be an upper bound on the number of entries stored in any row on the calling process. See the Xpetra::MatrixFactory header file here:

https:/trilinos/Trilinos/blob/f0350316239aaf41e8e6b82378612aaedf9cc72c/packages/xpetra/sup/Matrix/Xpetra_MatrixFactory.hpp

It's a sparse matrix, right? Suppose that the matrix is N x N, and you have P processes. That constant is 2N. You've just asked Xpetra to create a matrix that has 2N^2/P space on each process. The reason this might not cause trouble until the Export operation, is that Tpetra::CrsMatrix currently allocates lazily. It doesn't actually allocate until you insert entries. Insertion in this case happens at the Export.

This might not be the cause of the issue you're observing, because the matrix doesn't look very big. However, could you please try this again, but use zero (0) as the second argument of MatrixFactory::Build?

@mhoemmen
Copy link
Contributor

@searhein Another interesting part of the test you wrote, is that you construct an Export and use it in reverse mode, in doImport. Xpetra folks, that should be OK, right? @trilinos/xpetra

@mhoemmen
Copy link
Contributor

@searhein Could you please point me to the implementation of FROSch::ExtendOverlapByOneLayer? I'm hoping that this does what, say, Ifpack2 or Ifpack do when they implement overlap for additive Schwarz.

@srajama1
Copy link
Contributor

If this would be hard to track down the root cause, is rolling back the commit that caused the issue an option ? @searhein has given a commit that works and another where it doesn't.

@mhoemmen
Copy link
Contributor

It would be nice to see the implementation of ExtendOverlapByOneLayer. The Tpetra commits in question pass all tests with the check-in test script, all the way downstream. This includes Ifpack2 tests, that should exercise overlapping additive Schwarz.

@searhein
Copy link
Contributor Author

@mhoemmen Let me comment first comment the two issues you mentioned:

  1. Since I am more familiar with Epetra and since I wanted to obtain exactly the same results as in the Epetra test, I used the Epetra maps and matrices from Galeri and converted them to Tpetra in the test. However, afterwards I solely use Tpetra objects.
  2. You are right, FROSch::ExtendOverlapByOneLayer computes the Map for the algebraic equivalent of single level of mesh overlap. Therefore, it should be more or less the same which is also done in Ifpack or Ifpack2. I am sorry, but I don't understand your point with respect to the second argument of MatrixFactory::Build. Let me first say that, using 0 as the second argument, did not resolve the issue. Also, 2*tmpMatrix->getGlobalMaxNumRowEntries() should clearly an upper bound for the number of nonzeros per row in the matrix. The matrix is sparse and it is basically the discretisation of a finite difference scheme. In the documentation https://trilinos.org/docs/dev/packages/xpetra/doc/html/classXpetra_1_1CrsMatrix.html, I found that getGlobalMaxNumRowEntries()

Returns the maximum number of entries across all rows/columns on all nodes.

The function call should therefore return something like 7. For the special case in this example even tmpMatrix->getGlobalMaxNumRowEntries() should be a sufficient maxNumEntriesPerRow.

The implementation of ExtendOverlapByOneLayer can be found in the file
packages/shylu/shylu_dd/frosch/src/Tools/FROSch_Tools_def.hpp.

mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Sep 29, 2017
@trilinos/tpetra I added a test for CrsMatrix doExport that exercises
use cases likely to be found in the following Albany issue:

https:/gahansen/Albany/issues/182

This test may also be relevant to trilinos#1752.

The test does a doExport from a fill-complete CrsMatrix with
overlapping row Map, to the following cases of CrsMatrix:

  - DynamicProfile, locally indexed
  - StaticProfile, locally indexed
  - Constant fill-complete CrsGraph, locally indexed

I specifically wrote the test so that the target matrix get some
entries in each row from one process, other entries in the same row
from another process, and other entries from both processes (thus
exercising the Tpetra::ADD CombineMode fully).

The test passes with both Kokkos::Serial and Kokkos::Cuda back-ends.
The CUDA build uses CUDA 8.0.44 and GCC 5.3.0, with a Kepler GPU and
Intel host CPU.  I'll need to try it with a Pascal GPU and IBM POWER8
host CPU to exercise the Albany issue mentioned above.  @ikalash
kindly sent me the configuration for testing on that platform.

See the test documentation (comments in the test .cpp file itself) to
see why the test does not yet exercise the case of a globally indexed
target.
@mhoemmen
Copy link
Contributor

mhoemmen commented Oct 2, 2017

Hi @searhein ! I took a look at the implementation of ExtendOverlapByOneLayer, in the file you mentioned. I noticed that line 271 appears to have a bug:

https:/searhein/Trilinos/blob/25728b63a8e1ef8237ac6df37e19da45984f61a0/packages/shylu/shylu_dd/frosch/src/Tools/FROSch_Tools_def.hpp#L271

This line calls fillComplete with no arguments, on the overlapping matrix. This is incorrect. In both Epetra and Tpetra, if the row Map is not the same as the domain and range Maps, then you must pass the domain and range Maps as arguments to FillComplete resp. fillComplete. Otherwise, the matrix will use the row Map as the domain and range Maps. This is wrong, because the domain and range Maps must always be one to one.

Besides this correctness issue, the function for computing the overlapping matrix could be a lot simpler. You do not actually need to read the graph or matrix entries, as long as the graph or matrix has a column Map and Import object. For a Tpetra example using a CrsGraph, please refer to Trilinos/packages/ifpack2/src/Ifpack2_CreateOverlapGraph.hpp. There is a somewhat less elegant efficient CrsMatrix example in the constructor of Ifpack2::OverlappingRowMatrix, in Trilinos/packages/ifpack2/src/Ifpack2_OverlappingRowMatrix_def.hpp (search for "The big import loop"). Epetra has comparable examples in the Ifpack package. For an Epetra_CrsGraph example, see Trilinos/packages/ifpack/src/Ifpack_OverlapGraph.cpp.

There are other inefficiencies in this file as well. For example, BuildUniqueMap should just use CreateOneToOne with Epetra, and createOneToOne with Tpetra. If Xpetra lacks an interface to these functions, then it would make sense to write one, rather than trying to work around it. Your approach in this function will have memory efficiency problems if the input map is sparse, that is, if it has many fewer indices than the linearMap. Also, linearMap needs to have the same index base as map.

I wish I had more free time to help you out here. We really do appreciate contributions like this. I don't mean to put you down; I really just want to make sure that both your code and my code are doing the right thing :-) .

@searhein
Copy link
Contributor Author

@kddevin @tjfulle @mhoemmen
I was now able to fix my code introducing additional exporters and importers in some parts of the code, where the reverse mode hung; see #1834.
The code runs through now and gives me the correct result. However, in order to reduce communication cost, it would be nice if I could remove those additional exporters and importers as soon as the issue is fixed. Thanks!

@tjfulle
Copy link
Contributor

tjfulle commented Oct 10, 2017

@searhein, I am running the checkin script right now and should have a PR open very soon with a fix.

tjfulle added a commit to tjfulle/Trilinos that referenced this issue Oct 10, 2017
This commit fixes a bug found while investigating trilinos#1752.  It also adds a new
test that should prevent a regression in the future.
@mhoemmen
Copy link
Contributor

@searhein I'm curious why you need reverse mode; would you mind explaining?

@mhoemmen
Copy link
Contributor

@tjfulle I'm reviewing your PR right now. Thanks!

@searhein
Copy link
Contributor Author

@mhoemmen I need to communicate in both ways with the same maps. However, I would like to create the Import/Export object only once.

@tjfulle
Copy link
Contributor

tjfulle commented Oct 10, 2017

@mhoemmen , The only PR I have open is for a different issue. It just adds a test for pack/unpack. I'm still running the checkin script for this issue. A PR is eminent.

@mhoemmen
Copy link
Contributor

@tjfulle ah sorry, that was the PR coming from the check-in test script, in your fork :) .

@mhoemmen
Copy link
Contributor

@searhein Thanks for the explanation!

@tjfulle
Copy link
Contributor

tjfulle commented Oct 10, 2017

@mhoemmen, I pushed to that branch as an intermediate, I am now running the full checkin script on my blade. But, time spent looking at the commit is not lost, since I'll open a PR based on it soon.

There is only a 1 line code change, but I updated a lot of the debug statements to be more informative, so it appears there are more changes that there really are.

But, PR #1769 is still open :)

@mhoemmen
Copy link
Contributor

@tjfulle I'm working on #1769 now :) .

tjfulle added a commit to tjfulle/Trilinos that referenced this issue Oct 10, 2017
This commit fixes a bug found while investigating trilinos#1752.  It also adds a new
test that should prevent a regression in the future.

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1536,notpassed=0 (110.84 min)
@tjfulle
Copy link
Contributor

tjfulle commented Oct 10, 2017

@mhoemmen, PR #1838 fixes the hang for the test case I added. It also fixes the hang in @kddevin's test she posted here.

@searhein, could you check your code against PR #1838 to see if it fixes the hang you are experiencing.

searhein added a commit that referenced this issue Oct 10, 2017
Workaround for Tpetra issue #1752. Enabling Tpetra test
@mhoemmen
Copy link
Contributor

@tjfulle I hereby approve #1838. Please feel free to push if it builds and passes tests for you. Don't worry about the CUDA build for now; this should be independent of that. Thanks! :-D

tjfulle added a commit that referenced this issue Oct 10, 2017
@tjfulle
Copy link
Contributor

tjfulle commented Oct 10, 2017

@searhein, this should now be fixed. When you get a chance, if you could check your code and close this issue if it is fixed.

@searhein
Copy link
Contributor Author

@tjfulle I can confirm that #1838 fixed this issue for my code.
@kddevin @mhoemmen @tjfulle Thank you a lot for fixing this :-)

searhein pushed a commit to searhein/Trilinos that referenced this issue Oct 10, 2017
searhein pushed a commit to searhein/Trilinos that referenced this issue Oct 10, 2017
This reverts commit 8cba386, reversing
changes made to 77d0f5b.
searhein pushed a commit to searhein/Trilinos that referenced this issue Oct 10, 2017
tjfulle referenced this issue Oct 11, 2017
…evelop

* 'develop' of https:/trilinos/Trilinos:
  Set up dual submits to also submit to testing-vm.sandia.gov/cdash/ (#1746)
  Tacho - added some profiling capability.
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Oct 11, 2017
Some github wizardry and black-magic removed changes committed as part of
2c4f08f.  Probably due to the fact that
@searhein and I were working concurrently on fixes to trilinos#1752.  This commit just
reapplies the fix to trilinos#1752 in 2c4f08f.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Oct 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants