Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nalu-Wind segfaulting with newer Trilinos #802

Closed
sayerhs opened this issue Feb 15, 2021 · 6 comments
Closed

Nalu-Wind segfaulting with newer Trilinos #802

sayerhs opened this issue Feb 15, 2021 · 6 comments
Assignees

Comments

@sayerhs
Copy link
Contributor

sayerhs commented Feb 15, 2021

With the latest trilinos/Trilinos@e990f2e8c81 I am seeing segfaults when running Nalu-Wind (see full stack trace below). However, running with an older version (pre-STK updates) trilinos/Trilinos@b0dec83156b I don't get this error.

MPT: (gdb) #0  0x00002ab3a470114c in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002ab3a4a3ec96 in mpi_sgi_system (
MPT: #2  MPI_SGI_stacktraceback (
MPT:     header=header@entry=0x7ffcc0db2290 "MPT ERROR: Rank 1(g:1) received signal SIGSEGV(11).\n\tProcess ID: 260225, Host: r1i7n35, Program: /home/sanantha/exawind/source/nalu-wind/build_test/naluX\n\tMPT Version: HPE MPT 2.22  03/31/20 16:12:29\n") at sig.c:340
MPT: #3  0x00002ab3a4a3ee8f in first_arriver_handler (signo=signo@entry=11,
MPT:     stack_trace_sem=stack_trace_sem@entry=0x2ab3affc0080) at sig.c:489
MPT: #4  0x00002ab3a4a3f123 in slave_sig_handler (signo=11,
MPT:     siginfo=<optimized out>, extra=<optimized out>) at sig.c:565
MPT: #5  <signal handler called>
MPT: #6  0x00002ab3a61c8d2e in stk::mesh::Selector stk::mesh::selectUnion<std::vector<stk::mesh::Part*, std::allocator<stk::mesh::Part*> > >(std::vector<stk::mesh::Part*, std::allocator<stk::mesh::Part*> > const&) ()
MPT:    from /nopt/nrel/ecom/exawind/exawind-2020-09-21/install/gcc/trilinos-2021-02-15/lib/libstk_mesh_base.so.13
MPT: #7  0x00002ab3a37669cf in sierra::nalu::AuxFunctionAlgorithm::execute() ()
MPT:     at ../src/AuxFunctionAlgorithm.C:62
MPT: #8  0x00002ab3a38c9f4e in sierra::nalu::Realm::populate_initial_condition (
MPT:     this=0x3073ac0)
MPT:     at /nopt/nrel/ecom/hpacf/compilers/2020-07/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.4.0/gcc-8.4.0-2a3vha6hlw4xc5ja3jyhr7huzaxuw2kt/include/c++/8.4.0/bits/stl_vector.h:930
MPT: #9  0x00002ab3a3922866 in sierra::nalu::TimeIntegrator::prepare_for_time_integration (this=this@entry=0x33ded80) at ../src/TimeIntegrator.C:161
MPT: #10 0x00002ab3a3923394 in sierra::nalu::TimeIntegrator::integrate_realm (
MPT:     this=0x33ded80) at ../src/TimeIntegrator.C:312
MPT: #11 0x00002ab3a38f8953 in sierra::nalu::Simulation::run (
MPT:     this=this@entry=0x7ffcc0db39c0) at ../src/Simulation.C:173
MPT: #12 0x0000000000411ccc in main () at ../nalu.C:178
MPT: #13 0x00002ab3ada96505 in __libc_start_main () from /lib64/libc.so.6
MPT: #14 0x00000000004129be in _start ()
MPT:     at /nopt/nrel/ecom/hpacf/compilers/2020-07/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.4.0/gcc-8.4.0-2a3vha6hlw4xc5ja3jyhr7huzaxuw2kt/include/c++/8.4.0/bits/ios_base.h:170

Cc: @ldh4

@alanw0
Copy link
Contributor

alanw0 commented Feb 15, 2021

ok we'll check into this. There's a good chance that this is a case of giving a null part pointer to a Selector.
Since this case in particular looks like it is giving a part-vector to the selectUnion function, we can do something in selectUnion on the stk side to ignore the null pointer in the part-vector.

@sayerhs
Copy link
Contributor Author

sayerhs commented Feb 15, 2021

Thanks @alanw0 ... there was a null part. I've opened #803 to fix this on the nalu-wind side.

@sayerhs
Copy link
Contributor Author

sayerhs commented Feb 16, 2021

@rcknaus With recent updates to Trilinos/STK, conduction_p4 and taylorGreenVortex_p3 have started segfaulting. @alanw0 indicated that there is a new behavior where Selector does not accept nullptr for Part. We had a few other tests where this was an issue... and fixing those nullptr parts eliminated the segfaults. Is it likely that the parts promotion could be causing some nullptr for Part* for these higher-order tests?

@rcknaus
Copy link
Contributor

rcknaus commented Feb 16, 2021

I can take a look at it. I check for nullptr parts before adding them to the selector, but it's possible I missed it somewhere.

@sayerhs
Copy link
Contributor Author

sayerhs commented Feb 21, 2021

References: trilinos/Trilinos#8727 and trilinos/Trilinos#8784

@sayerhs
Copy link
Contributor Author

sayerhs commented Feb 23, 2021

Fixed with Feb 23rd run CDash link

@sayerhs sayerhs closed this as completed Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants