Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use explicit staging buffer for constant mem optimization and make synchronization less pessimistic #3234

Merged
merged 1 commit into from
Jul 29, 2020

Conversation

streichler
Copy link

This pull request does two related things:

  1. It introduces an explicit staging buffer in pinned host memory for kernel launch data that's destined for
    device constant memory, allowing a clear separation between the part of the transfer that's synchronous
    (i.e. the transfer from the caller's stack to the staging buffer) and the part that's asynchronous (the transfer
    from the staging buffer to actual device constant memory).
    Note that the staging buffer's property of being per-device (as opposed to per-stream or truly global) is
    a bit awkward for the current Kokkos implementation. I'm tracking the constant buffer using static members
    of the CudaInternal class so that they're properly shared by all Kokkos::Cuda instances (that are right now
    all on the same device), but changes to Kokkos to better support multiple devices will need to move these to
    whatever object tracks per-device information.

  2. Cleans up the host/device synchronization to be both correct (i.e. actually guarantee that the copy from
    the caller's stack is complete before returning) and more precise (i.e. waiting only for the most recent grid
    launch to use the constant mem buffer rather than waiting for all previously issued CUDA work, whether or
    not its using the constant mem buffer). A side benefit from a Legion+Kokkos interop perspective is that the
    use of cudaEventSynchronize plays more nicely with Legion task parallelism than cudaDeviceSynchroinze
    does.
    Although it is not attempted in this PR, the use of events for the host/device synchronization permits a further
    performance optimization in which multiple grid launches that collectively fit within the 32KB constant mem
    buffer could be permitted to run concurrently. A new launch would then just wait on exactly the events for the
    grids using constant mem buffer locations that are about to be reused. The current implementation represents
    the degenerate case in which every grid is assumed to need the whole 32KB.

@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@dalg24
Copy link
Member

dalg24 commented Jul 28, 2020

OK to test

@streichler
Copy link
Author

Changed three occurrences of 0 to nullptr to address build failures.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@crtrott crtrott merged commit f6ae396 into kokkos:develop Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants