use explicit staging buffer for constant mem optimization and make synchronization less pessimistic #3234

streichler · 2020-07-28T20:26:46Z

This pull request does two related things:

It introduces an explicit staging buffer in pinned host memory for kernel launch data that's destined for
device constant memory, allowing a clear separation between the part of the transfer that's synchronous
(i.e. the transfer from the caller's stack to the staging buffer) and the part that's asynchronous (the transfer
from the staging buffer to actual device constant memory).
Note that the staging buffer's property of being per-device (as opposed to per-stream or truly global) is
a bit awkward for the current Kokkos implementation. I'm tracking the constant buffer using static members
of the CudaInternal class so that they're properly shared by all Kokkos::Cuda instances (that are right now
all on the same device), but changes to Kokkos to better support multiple devices will need to move these to
whatever object tracks per-device information.
Cleans up the host/device synchronization to be both correct (i.e. actually guarantee that the copy from
the caller's stack is complete before returning) and more precise (i.e. waiting only for the most recent grid
launch to use the constant mem buffer rather than waiting for all previously issued CUDA work, whether or
not its using the constant mem buffer). A side benefit from a Legion+Kokkos interop perspective is that the
use of cudaEventSynchronize plays more nicely with Legion task parallelism than cudaDeviceSynchroinze
does.
Although it is not attempted in this PR, the use of events for the host/device synchronization permits a further
performance optimization in which multiple grid launches that collectively fit within the 32KB constant mem
buffer could be permitted to run concurrently. A new launch would then just wait on exactly the events for the
grids using constant mem buffer locations that are about to be reused. The current implementation represents
the degenerate case in which every grid is assumed to need the whole 32KB.

dalg24-jenkins · 2020-07-28T20:26:48Z

Can one of the admins verify this patch?

dalg24 · 2020-07-28T21:34:04Z

OK to test

streichler · 2020-07-29T01:06:33Z

Changed three occurrences of 0 to nullptr to address build failures.

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

core/src/Cuda/Kokkos_Cuda_Instance.hpp

…nchronization less pessimistic

dalg24

LGTM

streichler force-pushed the explicit_staging branch from 33339c2 to 20f9039 Compare July 29, 2020 01:04

dalg24 reviewed Jul 29, 2020

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Show resolved Hide resolved

core/src/Cuda/Kokkos_Cuda_Instance.hpp Outdated Show resolved Hide resolved

use explicit staging buffer for constant mem optimization and make sy…

97c2bcc

…nchronization less pessimistic

streichler force-pushed the explicit_staging branch from 20f9039 to 97c2bcc Compare July 29, 2020 03:47

dalg24 approved these changes Jul 29, 2020

View reviewed changes

crtrott merged commit f6ae396 into kokkos:develop Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use explicit staging buffer for constant mem optimization and make synchronization less pessimistic #3234

use explicit staging buffer for constant mem optimization and make synchronization less pessimistic #3234

streichler commented Jul 28, 2020

dalg24-jenkins commented Jul 28, 2020

dalg24 commented Jul 28, 2020

streichler commented Jul 29, 2020

dalg24 left a comment

use explicit staging buffer for constant mem optimization and make synchronization less pessimistic #3234

use explicit staging buffer for constant mem optimization and make synchronization less pessimistic #3234

Conversation

streichler commented Jul 28, 2020

dalg24-jenkins commented Jul 28, 2020

dalg24 commented Jul 28, 2020

streichler commented Jul 29, 2020

dalg24 left a comment

Choose a reason for hiding this comment