[THRUST] Faster multi dimensional argsort by segmented sort #7195

masahi · 2021-01-04T03:15:54Z

Current implementation of thrust argsort, when given multi dimensional inputs to sort along the inner most axis, is very inefficient: it does n_iter calls to thrust sort. See

tvm/src/runtime/contrib/thrust/thrust.cu

Lines 50 to 65 in bad149e

 int n_iter = 1; 

 for (int i = 0; i < input->ndim - 1; ++i) { 

 n_iter *= input->shape[i]; 

 } 

 thrust::copy(data_ptr, data_ptr + n_iter * n_values, values_ptr); 

 for (int i = 0 ; i < n_iter; ++i) { 

 n_values = get_sort_len(i); 

 thrust::sequence(indices_ptr, indices_ptr + n_values); 

 if (is_ascend) { 

 thrust::sort_by_key(values_ptr, values_ptr + n_values, indices_ptr); 

 } else { 

 thrust::sort_by_key(values_ptr, values_ptr + n_values, indices_ptr, 

 thrust::greater<DataType>()); 

 }

When the outer dimension is large, the performance of thrust argsort is far from optimal. In particular, the thrust numbers shown in the TIR mergesort PR #7099 do not reflect the true performance thrust can achieve.

This PR replaces n_iter calls to thrust argsort with one segmented sort by key. Since thrust doesn't provide API to do segmented sort, I used a neat back-to-back stable-sort-by-key trick explained in https://groups.google.com/forum/#!topic/thrust-users/BoLsxO6b4FY. My implementation is a bit more complicated because we need to do segmented sort by key, not just segmented sort.

Here are the numbers I get using the same benchmark script used in #7099, measured on GTX 1070 ti. When the outer dimension is small (like 2, 2, 2000 case), my change makes it slower due to the overhead from two calls to stable_sort_by_key. But other than that, it is much faster than one we have now.

Also, I removed tvm.contrib.thrust.sort_nms and argsort_nms_thrust, since they are not used anymore.

please review @kazum @Laurawly
(cc @mbrookhart when you are back, this should be exciting for you!)

Shape	current thrust	after this PR	TIR mergesort
(2000, 2, 2)	0.17	0.29	1.63
(2, 2000, 2)	0.17	0.31	1.62
(2, 2, 2000)	0.16	0.30	1.62
(4000, 2, 2)	0.18	0.30	3.80
(2, 4000, 2)	0.18	0.30	3.24
(2, 2, 4000)	0.17	0.30	3.92
(2, 12000, 2)	0.52	0.59	12.96
(2, 2, 12000)	0.57	0.58	11.40
(12000, 2, 2)	0.63	0.59	11.77
(2000, 8, 8)	2.55	0.90	4.08
(8, 2000, 8)	2.43	0.81	3.56
(8, 8, 2000)	2.47	0.88	3.02
(4000, 8, 8)	2.55	1.43	9.37
(8, 4000, 8)	2.54	1.42	9.68
(8, 8, 4000)	2.53	1.41	6.14
(12000, 8, 8)	14.72	3.32	39.37
(8, 12000, 8)	13.79	3.24	40.27
(8, 8, 12000)	13.29	3.19	25.87

mbrookhart · 2021-01-04T17:03:39Z

This looks great. My only concern would possibly be that some object detection models (I'm thinking gluon SSD) have a very large number of boxes they sort before NMS. Could you add shapes (1, 1e5) and (1, 1e6) to your test? I expect my mergesort will fail badly, but I wonder what the difference between your implementation and the current thrust implementation will be.

mbrookhart · 2021-01-04T17:06:41Z

Also, I think you and I are using different versions of CUDA for the same GPU, that might explain the difference in the numbers I posted in #7099 and you posted here.

trevor-m · 2021-01-04T18:54:22Z

Nice! Have you also looked at CUB's DeviceSegmentedRadixSort::SortPairsDescending ? It sounds like it is exactly what you need with no tricks required. It's used by some fast NMS implementations such as TensorRT.

masahi · 2021-01-04T20:26:15Z

@mbrookhart I have a fast-path for one segment case, so the perf is the same between current / new. I'll update the condition to work for dimension other than two.

tvm/src/runtime/contrib/thrust/thrust.cu

Line 57 in 26254f5

if (input->ndim == 1 || (input->ndim == 2 && input->shape[0] == 1)) {

@trevor-m Yes I briefly looked at cub's segmented sort. My impression is that it launches one thread block per segment. This sounds great when there are many segments to sort and each of segment is not so big. I'm not sure if that is a good fit for our use case - I think we are more likely to sort a few, but large segments, and most likely we only have one segment. I'm actually surprised to hear that TRT uses cub's segmented sort.

masahi · 2021-01-12T01:59:04Z

@Laurawly @kazum This is ready to go, please have a look

masahi · 2021-01-13T06:42:21Z

Thanks @mbrookhart @trevor-m

) * remove sort nms * add segmented sort by key impl * bug fix, test pass * updated fast path condition to work for all dims

masahi added 3 commits January 4, 2021 12:04

remove sort nms

8b792a3

add segmented sort by key impl

697956d

bug fix, test pass

26254f5

updated fast path condition to work for all dims

05c4611

ZihengJiang added the status: need review label Jan 6, 2021

mbrookhart approved these changes Jan 12, 2021

View reviewed changes

masahi merged commit 1d07f1a into apache:main Jan 13, 2021

masahi added a commit to masahi/tvm that referenced this pull request Jan 14, 2021

[THRUST] Faster multi dimensional argsort by segmented sort (apache#7195

34e843c

) * remove sort nms * add segmented sort by key impl * bug fix, test pass * updated fast path condition to work for all dims

masahi added a commit to masahi/tvm that referenced this pull request Jan 18, 2021

[THRUST] Faster multi dimensional argsort by segmented sort (apache#7195

8ef7715

) * remove sort nms * add segmented sort by key impl * bug fix, test pass * updated fast path condition to work for all dims

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021

[THRUST] Faster multi dimensional argsort by segmented sort (apache#7195

91636d2

) * remove sort nms * add segmented sort by key impl * bug fix, test pass * updated fast path condition to work for all dims

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[THRUST] Faster multi dimensional argsort by segmented sort #7195

[THRUST] Faster multi dimensional argsort by segmented sort #7195

masahi commented Jan 4, 2021 •

edited

Loading

mbrookhart commented Jan 4, 2021

mbrookhart commented Jan 4, 2021

trevor-m commented Jan 4, 2021

masahi commented Jan 4, 2021 •

edited

Loading

masahi commented Jan 12, 2021

masahi commented Jan 13, 2021

	int n_iter = 1;
	for (int i = 0; i < input->ndim - 1; ++i) {
	n_iter *= input->shape[i];
	}

	thrust::copy(data_ptr, data_ptr + n_iter * n_values, values_ptr);

	for (int i = 0 ; i < n_iter; ++i) {
	n_values = get_sort_len(i);
	thrust::sequence(indices_ptr, indices_ptr + n_values);
	if (is_ascend) {
	thrust::sort_by_key(values_ptr, values_ptr + n_values, indices_ptr);
	} else {
	thrust::sort_by_key(values_ptr, values_ptr + n_values, indices_ptr,
	thrust::greater<DataType>());
	}

[THRUST] Faster multi dimensional argsort by segmented sort #7195

[THRUST] Faster multi dimensional argsort by segmented sort #7195

Conversation

masahi commented Jan 4, 2021 • edited Loading

mbrookhart commented Jan 4, 2021

mbrookhart commented Jan 4, 2021

trevor-m commented Jan 4, 2021

masahi commented Jan 4, 2021 • edited Loading

masahi commented Jan 12, 2021

masahi commented Jan 13, 2021

masahi commented Jan 4, 2021 •

edited

Loading

masahi commented Jan 4, 2021 •

edited

Loading