Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Team- and thread-level sort, sort_by_key #5317

Merged
merged 13 commits into from
Aug 31, 2022
Merged

Conversation

brian-kelley
Copy link
Contributor

@brian-kelley brian-kelley commented Aug 8, 2022

(Addresses #645)

Add sort functions that can be called from device,
and exploit team and thread level parallelism. The new functions
use bitonic sort, which is good for this because it's in-place but
highly parallel (when sorting N items, N/2 pairs are compared at once).
It's also comparison-based, so there are versions that can accept an
arbitrary comparison functor (bool operator()(a, b) returns true if key a
goes before key b, otherwise false). Any key type that can be copied in device code will work, as long as there is operator< for it (or you provide a comparator). So things like Kokkos::pair (see #4487) are fine.

sort_by_key is the same, but in addition to the keys it takes a values
view of the same length. The pairs keys(i) and values(i) are all sorted
according to the key. This is useful for sorting CRS matrices for
example.

The interfaces to sort_[team/thread] and sort_by_key[team/thread] are designed to be similar to Thrust, except they take Kokkos::Views and not iterators.

The new function signatures (8 of them, but all implemented in terms of 2 Impl:: functions):

template <class TeamMember, class ViewType>
KOKKOS_INLINE_FUNCTION void sort_team(const TeamMember& t, const ViewType& view);

template <class TeamMember, class ViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_team(const TeamMember& t, const ViewType& view, const Comparator& comp);

template <class TeamMember, class KeyViewType, class ValueViewType>
KOKKOS_INLINE_FUNCTION void sort_by_key_team(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView);

template <class TeamMember, class KeyViewType, class ValueViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_by_key_team(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView, const Comparator& comp);

template <class TeamMember, class ViewType>
KOKKOS_INLINE_FUNCTION void sort_thread(const TeamMember& t, const ViewType& view);

template <class TeamMember, class ViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_thread(const TeamMember& t, const ViewType& view, const Comparator& comp);

template <class TeamMember, class KeyViewType, class ValueViewType>
KOKKOS_INLINE_FUNCTION void sort_by_key_thread(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView);

template <class TeamMember, class KeyViewType, class ValueViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_by_key_thread(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView, const Comparator& comp);

BTW, there are 2 full sort implementations now since TeamVectorRange() and ThreadVectorRange() are functions, not types that could be templated on. But once the generic ranges and team/thread handles get added, all the functions could be in terms of 1 implementation.

I tested this myself on architectures PASCAL61, VEGA908, INTEL_XEHP, and made sure that all 8 functions are covered by the new tests.

@brian-kelley brian-kelley added the Enhancement Improve existing capability; will potentially require voting label Aug 8, 2022
@brian-kelley brian-kelley self-assigned this Aug 8, 2022
@mhoemmen
Copy link
Contributor

mhoemmen commented Aug 9, 2022

@brian-kelley Have you considered using CUB for the CUDA back-end? CUB should always come with your CUDA installation.

@brian-kelley
Copy link
Contributor Author

brian-kelley commented Aug 9, 2022

@mhoemmen Yes, the reasons I didn't use CUB is that its two block-level sorts (Radix and Merge) both require __shared__ temporary storage, and both require the input to be partitioned across threads with a fixed number of items per thread (so the lengths have some small-ish upper bound based on hardware limitations). This bitonic sort doesn't need extra space and any team/thread size can sort any length array (and the data can be any View, any layout, shared or global). So if the Kokkos user wants to use every byte of shared available for other stuff, they can still use this.

There is still a slight performance hit - when we tried to publish a KokkosKernels paper, I ran some experiments against CUB BlockRadixSort. On V100, this implementation sorted a bunch of 256-element int arrays (each one started in global, was loaded to shared or registers, and was written back to global after sorting) about 9% slower than BlockRadixSort.

Add sort functions that can be called from device,
and exploit team and thread level parallelism. The new functions
use bitonic sort, which is good for this because it's in-place but
highly parallel (when sorting N items, N/2 pairs are compared at once).
It's also comparison-based, so there are versions that can accept an
arbitrary comparison functor (operator()(a, b) returns true if key a
goes before key b).

sort_by_key is the same, but in addition to the keys it takes a values
view of the same length. The pairs keys(i) and values(i) are all sorted
according to the key. This is useful for sorting CRS matrices for
example.
@brian-kelley
Copy link
Contributor Author

brian-kelley commented Aug 10, 2022

Looks like testing had a random failure in cuda.debug_pin_um_to_host

algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved
algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved
algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved
algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved
algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved
- Use existing swap function and binary less-than predicate
- Add FIXMEs about adding ceiling power-of-2 utility (used several places)
(generic across team-level and thread-level using templates)
@brian-kelley
Copy link
Contributor Author

@masterleinad Thanks for the review - I just pushed all the suggestions.

algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved
algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty good. But I'd like us to put it into Experimental, and then hopefully by Kokkos 4.1 we have the more capable execution resource handles, so instead of having:

sort_team(TeamHandle, ...)
sort_thread(TeamHandle, ...)

we simply have:

sort(TeamHandle, ...)
sort(ThreadHandle, ...)
sort(InlineHandle, ...)

@dalg24
Copy link
Member

dalg24 commented Aug 11, 2022

I think this is pretty good. But I'd like us to put it into Experimental, and then hopefully by Kokkos 4.1 we have the more capable execution resource handles, so instead of having:

sort_team(TeamHandle, ...)
sort_thread(TeamHandle, ...)

we simply have:

sort(TeamHandle, ...)
sort(ThreadHandle, ...)
sort(InlineHandle, ...)

Agree about Experimental::. Maybe put it in a separate header.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still avoid some duplicated code. Otherwise, this looks good to me.

algorithms/src/Kokkos_NestedSort.hpp Outdated Show resolved Hide resolved
- in Sort, put kokkos includes together
- in NestedSort, include Kokkos_Core.hpp so that it can be used
standalone
- in NestedSort, move #includes inside include guard
@brian-kelley
Copy link
Contributor Author

All the suggestions so far are pushed now

@masterleinad
Copy link
Contributor

27: [ RUN      ] openmptarget.NestedSort
27: OpenMPTarget backend requires a minimum of 32 threads per team.

@brian-kelley
Copy link
Contributor Author

Looks like the test machine fetnat03 has a full disk.

@brian-kelley
Copy link
Contributor Author

brian-kelley commented Aug 15, 2022

This last round of testing exposed a little issue with TeamPolicy<OpenMPTarget> - vector_length_max() returned 32, but with that high of a value Kokkos::AUTO can't also satisfy the requirement that team size is at least 32.

Does this deserve its own issue? Somewhat related to #4685 , which was resolved by the suggestion to always use AUTO for team size of dummy policies.

- remove code duplication (only doing compare+swap in one place)
- fix team size < 32 on OMPTarget
algorithms/src/Kokkos_NestedSort.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestSort.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestSort.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestSort.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestSort.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestSortCommon.hpp Outdated Show resolved Hide resolved
- Don't duplicate logic to randomly generate offsets/keys
- Use standard library to generate randoms on host for offsets
when EXPECT_EQ, EXPECT_TRUE tests fail.
@brian-kelley
Copy link
Contributor Author

@dalg24 I just pushed those suggestions (except I didn't change anything about the prefix sum)

@brian-kelley
Copy link
Contributor Author

@crtrott Could you give this another review? I moved the functions to Kokkos::Experimental.

@@ -35,6 +35,7 @@ foreach(Tag Threads;Serial;OpenMP;Cuda;HPX;HIP;SYCL;OpenMPTarget)
"#include <Test${Tag}_Category.hpp>\n"
"#include <TestRandomCommon.hpp>\n"
"#include <TestSortCommon.hpp>\n"
"#include <TestNestedSort.hpp>\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not a separate cpp source file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer generating multiple source file but this do not need to be handled here if you open an issue for it

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just open that additional issue, we also should try what happens if you would use this at the top level (i.e. with an execution space instance ...)

@@ -35,6 +35,7 @@ foreach(Tag Threads;Serial;OpenMP;Cuda;HPX;HIP;SYCL;OpenMPTarget)
"#include <Test${Tag}_Category.hpp>\n"
"#include <TestRandomCommon.hpp>\n"
"#include <TestSortCommon.hpp>\n"
"#include <TestNestedSort.hpp>\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue so we can split this and not end up with a single huge object file.

@crtrott crtrott merged commit e50a7be into kokkos:develop Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants