Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[THRUST] Faster multi dimensional argsort by segmented sort #7195

Merged
merged 4 commits into from
Jan 13, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 4, 2021

Current implementation of thrust argsort, when given multi dimensional inputs to sort along the inner most axis, is very inefficient: it does n_iter calls to thrust sort. See

int n_iter = 1;
for (int i = 0; i < input->ndim - 1; ++i) {
n_iter *= input->shape[i];
}
thrust::copy(data_ptr, data_ptr + n_iter * n_values, values_ptr);
for (int i = 0 ; i < n_iter; ++i) {
n_values = get_sort_len(i);
thrust::sequence(indices_ptr, indices_ptr + n_values);
if (is_ascend) {
thrust::sort_by_key(values_ptr, values_ptr + n_values, indices_ptr);
} else {
thrust::sort_by_key(values_ptr, values_ptr + n_values, indices_ptr,
thrust::greater<DataType>());
}

When the outer dimension is large, the performance of thrust argsort is far from optimal. In particular, the thrust numbers shown in the TIR mergesort PR #7099 do not reflect the true performance thrust can achieve.

This PR replaces n_iter calls to thrust argsort with one segmented sort by key. Since thrust doesn't provide API to do segmented sort, I used a neat back-to-back stable-sort-by-key trick explained in https://groups.google.com/forum/#!topic/thrust-users/BoLsxO6b4FY. My implementation is a bit more complicated because we need to do segmented sort by key, not just segmented sort.

Here are the numbers I get using the same benchmark script used in #7099, measured on GTX 1070 ti. When the outer dimension is small (like 2, 2, 2000 case), my change makes it slower due to the overhead from two calls to stable_sort_by_key. But other than that, it is much faster than one we have now.

Also, I removed tvm.contrib.thrust.sort_nms and argsort_nms_thrust, since they are not used anymore.

please review @kazum @Laurawly
(cc @mbrookhart when you are back, this should be exciting for you!)

Shape current thrust after this PR TIR mergesort
(2000, 2, 2) 0.17 0.29 1.63
(2, 2000, 2) 0.17 0.31 1.62
(2, 2, 2000) 0.16 0.30 1.62
(4000, 2, 2) 0.18 0.30 3.80
(2, 4000, 2) 0.18 0.30 3.24
(2, 2, 4000) 0.17 0.30 3.92
(2, 12000, 2) 0.52 0.59 12.96
(2, 2, 12000) 0.57 0.58 11.40
(12000, 2, 2) 0.63 0.59 11.77
(2000, 8, 8) 2.55 0.90 4.08
(8, 2000, 8) 2.43 0.81 3.56
(8, 8, 2000) 2.47 0.88 3.02
(4000, 8, 8) 2.55 1.43 9.37
(8, 4000, 8) 2.54 1.42 9.68
(8, 8, 4000) 2.53 1.41 6.14
(12000, 8, 8) 14.72 3.32 39.37
(8, 12000, 8) 13.79 3.24 40.27
(8, 8, 12000) 13.29 3.19 25.87

@mbrookhart
Copy link
Contributor

This looks great. My only concern would possibly be that some object detection models (I'm thinking gluon SSD) have a very large number of boxes they sort before NMS. Could you add shapes (1, 1e5) and (1, 1e6) to your test? I expect my mergesort will fail badly, but I wonder what the difference between your implementation and the current thrust implementation will be.

@mbrookhart
Copy link
Contributor

Also, I think you and I are using different versions of CUDA for the same GPU, that might explain the difference in the numbers I posted in #7099 and you posted here.

@trevor-m
Copy link
Contributor

trevor-m commented Jan 4, 2021

Nice! Have you also looked at CUB's DeviceSegmentedRadixSort::SortPairsDescending ? It sounds like it is exactly what you need with no tricks required. It's used by some fast NMS implementations such as TensorRT.

@masahi
Copy link
Member Author

masahi commented Jan 4, 2021

@mbrookhart I have a fast-path for one segment case, so the perf is the same between current / new. I'll update the condition to work for dimension other than two.

if (input->ndim == 1 || (input->ndim == 2 && input->shape[0] == 1)) {

@trevor-m Yes I briefly looked at cub's segmented sort. My impression is that it launches one thread block per segment. This sounds great when there are many segments to sort and each of segment is not so big. I'm not sure if that is a good fit for our use case - I think we are more likely to sort a few, but large segments, and most likely we only have one segment. I'm actually surprised to hear that TRT uses cub's segmented sort.

@masahi
Copy link
Member Author

masahi commented Jan 12, 2021

@Laurawly @kazum This is ready to go, please have a look

@masahi masahi merged commit 1d07f1a into apache:main Jan 13, 2021
@masahi
Copy link
Member Author

masahi commented Jan 13, 2021

Thanks @mbrookhart @trevor-m

masahi added a commit to masahi/tvm that referenced this pull request Jan 14, 2021
)

* remove sort nms

* add segmented sort by key impl

* bug fix, test pass

* updated fast path condition to work for all dims
masahi added a commit to masahi/tvm that referenced this pull request Jan 18, 2021
)

* remove sort nms

* add segmented sort by key impl

* bug fix, test pass

* updated fast path condition to work for all dims
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
)

* remove sort nms

* add segmented sort by key impl

* bug fix, test pass

* updated fast path condition to work for all dims
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
)

* remove sort nms

* add segmented sort by key impl

* bug fix, test pass

* updated fast path condition to work for all dims
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
)

* remove sort nms

* add segmented sort by key impl

* bug fix, test pass

* updated fast path condition to work for all dims
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants