use explicit staging buffer for constant mem optimization and make synchronization less pessimistic #3234
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request does two related things:
It introduces an explicit staging buffer in pinned host memory for kernel launch data that's destined for
device constant memory, allowing a clear separation between the part of the transfer that's synchronous
(i.e. the transfer from the caller's stack to the staging buffer) and the part that's asynchronous (the transfer
from the staging buffer to actual device constant memory).
Note that the staging buffer's property of being per-device (as opposed to per-stream or truly global) is
a bit awkward for the current Kokkos implementation. I'm tracking the constant buffer using static members
of the
CudaInternal
class so that they're properly shared by allKokkos::Cuda
instances (that are right nowall on the same device), but changes to Kokkos to better support multiple devices will need to move these to
whatever object tracks per-device information.
Cleans up the host/device synchronization to be both correct (i.e. actually guarantee that the copy from
the caller's stack is complete before returning) and more precise (i.e. waiting only for the most recent grid
launch to use the constant mem buffer rather than waiting for all previously issued CUDA work, whether or
not its using the constant mem buffer). A side benefit from a Legion+Kokkos interop perspective is that the
use of
cudaEventSynchronize
plays more nicely with Legion task parallelism thancudaDeviceSynchroinze
does.
Although it is not attempted in this PR, the use of events for the host/device synchronization permits a further
performance optimization in which multiple grid launches that collectively fit within the 32KB constant mem
buffer could be permitted to run concurrently. A new launch would then just wait on exactly the events for the
grids using constant mem buffer locations that are about to be reused. The current implementation represents
the degenerate case in which every grid is assumed to need the whole 32KB.