ctest -j and kokkos and hwloc #1104

bathmatt · 2017-03-03T17:29:43Z

@nmhamster @rppawlo
Is there any procedure on how to test in parallel on the various systems
kokkos/kokkos#630
points out an issue with openmp and thread binding.

I'm looking for a recipe on what do I configure with, and how do I test for the different platforms. Particularly for openmpi/RHEL, ellis, ride, shiller (cpu and gpu).

What are people using? Are you just reverting to -j1?

jjellio · 2017-03-05T22:02:01Z

@bathmatt
I've been building/testing on Mutrino (Cray) with OpenMP. I never enable HWLOC. A testing process I've found that works reasonably well:

export CORES_PER_TEST=4
export HT_PER_CORE=2

let OMP_NUM_THREADS=CORES_PER_TEST*HT_PER_CORE
export OMP_NUM_THREADS

  -D MPI_EXEC:PATH="aprun" \
# there are two other CMake variables you can use, pre/post numprocs_flags, I found it
# simpler to have this all on one line.
  -D MPI_EXEC_NUMPROCS_FLAG:STRING="-e;OMP_PLACES=cores;-e;OMP_DISPLAY_ENV=verbose;-d;${OMP_NUM_THREADS};-j;${HT_PER_CORE};-cc;depth;-n" \

# assumes you are in a batch environment
# does compiling on the compute node
aprun -n1 make -j

# first, run a lots of parallel tests. On HSW, 8 is OK,
# on KNL you can compute j to be larger

ctest -j8  |& tee parallel_shotgun_test.log

# some tests fail if they are run in parallel with other tests.
# This is probably due to poor binding, but I don't know 
# how to manage concurrent apruns so repeat the failed
# using j1

ctest --rerun-failed -j1 |& tee potentially_okay_tests.log

# tests that fail with j1, are clear failures, so I rerun with -VV
# usually, these failures are things like Zoltan1 that have unit 
# tests written that execute a shell script that spawns mpirun
# processes.. they will never work without providing an mpirun
# wrapper on cray. But atleast this is automated.
ctest --rerun-failed -VV   |& tee guaranteed_failures.log

I don't have a better answer, but this works OK for me. I que these as batch jobs, that configure + build. On Cray, if you don't want to use batch jobs, you need to configure inside an interactive job, because the path to APRUN is different in the batch env versus on login. I've requested they make this path consistent, because it breaks the ability to configure/compile on login, then create an interactive job and run ctest. I suspect I am the only person experiencing this headache.

bathmatt · 2017-03-07T16:29:18Z

Thanks for the suggestion, I haven't gotten to mutrino yet, still working other platforms :)

olivier-snl · 2017-03-10T16:58:48Z

I had a conversation with @crtrott about this issue. In a nutshell, the parallel tests are getting bound to the same core, thus the oversubscription and subsequent performance degradation. His strategy is to do binding to the socket and then allow the OS to manage the placement / migration within the socket. This can look different across the spectrum of OpenMP, Kokkos, etc. and different implementations of MPI, so we could talk about the specific configurations you are trying.

bathmatt · 2017-03-15T22:26:28Z

@olivier-snl I don't have really a standard configuration, I can look at removing hwloc if you think that would help? Ideally I'd want some procedure that works on the major test beds. mutrino/ellis/shiller/rhel6 and allows me to parallelize my tests.

If you have time next week maybe we sit down and chat over code? we can do this virutally

olivier-snl · 2017-03-16T13:58:04Z

@bathmatt Sure. Let's follow up off-thread to arrange.

nmhamster · 2017-03-16T14:31:43Z

@bathmatt @olivier-snl one of the issues here is that Kokkos utilizes HWLOC ahead of OpenMP. There is a plan to back off on this by checking the environment for OMP_ variables and if these are found, allowing the OpenMP runtime to do the binding instead of HWLOC. I understand this doesn't fully address @bathmatt original question but you might want to keep this in mind.

olivier-snl · 2017-03-16T14:42:09Z

@nmhamster Yes, I was aware of something along those lines. I think it addresses part of the problem, which is Kokkos/OpenMP. The other parts of the problem seem to be ctest itself, and MPI when used.

nmhamster · 2017-03-16T14:44:52Z

@olivier-snl - @bartlettroscoe mentioned that we might be able to use the KitWare contract to ask them to look into this a little bit. What we are really asking is for them to be scheduler aware. One method this could work is for them to check the environment for SLURM_ variables, and only really run one instance of ctest / cmake on SLURM's "0" process. We would need to get a similar level of support for LSF and PBS/Torque as well but it shouldn't be horrendously difficult.

olivier-snl · 2017-03-16T14:49:52Z

@nmhamster That would be extremely helpful. In my discussions with @crtrott he indicates that he is seeing ctest launch multiple of the simultaneous test executions to the same core(s). My reading of some of the ctest docs is that some test users actually want this oversubscription, presumably because they are testing for correctness but not for performance. Oversubscription not a good fit for us, of course.

nmhamster · 2017-03-16T14:54:58Z

@olivier-snl - I think what is happening is that ctest is launching -j <N> variants of the test but it isn't using the binding etc. What happens when it launches mpirun .. is that MPI performs a binding of the cores/sockets based on how we request that inside Trilinos configure. So the overlapping of the cores is really because we have <N> independent MPI runs goes all of which think they own the entire CPU set (because we have told them to do that).

olivier-snl · 2017-03-16T14:58:56Z

@nmhamster Yes, either the same Trilinos-configured MPI binding, or a default binding chosen by the MPI implementation, is being replicated across the tests and oversubscribed it seems.

bartlettroscoe · 2017-03-17T02:37:09Z

@bartlettroscoe mentioned that we might be able to use the KitWare contract to ask them to look into this a little bit.

Yes, this would fall under the current Kitware contract supports this SNL projects that we are not allowed to name here but cares a lot about this stuff.

Can we set up a short meeting to discuss this so that I understand what is really needed from CTest and how our CMake projects (e.g. using TriBITS) will be able to hook into that (hopefully seamlessly)? Who needs to attend this meeting? Once I understand what is needed from CTest, I can bring this up at a future Kitware meeting and get something put on the backlog for them to work on.

But this will require upgrding the version of CMake/CTest being used on all platforms where HWLOC is used (and conditional logic will need to be added to TriBITS for if the CTest feature is there or not). Is everyone ready for that? @bathmatt, is your team ready to upgrade CMake/CTest to take advantage of this? In the past, you expressed some trepidation with upgrading CMake/CTest on various machines (for example, to take better advantage of Ninja).

olivier-snl · 2017-03-20T14:13:56Z

@bartlettroscoe On the SNL side, I'd suggest to invite @crtrott @nmhamster @bathmatt @olivier-snl @bathmatt but all may not be necessary.

bartlettroscoe · 2017-03-21T01:55:23Z

We had a meeting with Kitware staff and they will add support to ctest to better handle pinning tests to cores to not overlap tests on the same cores. This will be tracked in:

https://gitlab.kitware.com/snl/project-1/issues/17

bathmatt added impacting: configure or build The issue is primarily related to configuring or building type: question impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) labels Mar 3, 2017

bathmatt closed this as completed Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctest -j and kokkos and hwloc #1104

ctest -j and kokkos and hwloc #1104

bathmatt commented Mar 3, 2017

jjellio commented Mar 5, 2017 •

edited

Loading

bathmatt commented Mar 7, 2017

olivier-snl commented Mar 10, 2017 •

edited

Loading

bathmatt commented Mar 15, 2017

olivier-snl commented Mar 16, 2017

nmhamster commented Mar 16, 2017

olivier-snl commented Mar 16, 2017

nmhamster commented Mar 16, 2017

olivier-snl commented Mar 16, 2017

nmhamster commented Mar 16, 2017

olivier-snl commented Mar 16, 2017

bartlettroscoe commented Mar 17, 2017

olivier-snl commented Mar 20, 2017

bartlettroscoe commented Mar 21, 2017

ctest -j and kokkos and hwloc #1104

ctest -j and kokkos and hwloc #1104

Comments

bathmatt commented Mar 3, 2017

jjellio commented Mar 5, 2017 • edited Loading

bathmatt commented Mar 7, 2017

olivier-snl commented Mar 10, 2017 • edited Loading

bathmatt commented Mar 15, 2017

olivier-snl commented Mar 16, 2017

nmhamster commented Mar 16, 2017

olivier-snl commented Mar 16, 2017

nmhamster commented Mar 16, 2017

olivier-snl commented Mar 16, 2017

nmhamster commented Mar 16, 2017

olivier-snl commented Mar 16, 2017

bartlettroscoe commented Mar 17, 2017

olivier-snl commented Mar 20, 2017

bartlettroscoe commented Mar 21, 2017

jjellio commented Mar 5, 2017 •

edited

Loading

olivier-snl commented Mar 10, 2017 •

edited

Loading