Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to return the k most common items and speed up most_common_*() methods #25

Merged
merged 3 commits into from
Jul 16, 2022

Conversation

clint-white
Copy link
Contributor

@clint-white clint-white commented Jul 9, 2022

Summary

This PR proposes the following changes to improve the flexibility and efficiency of Counter's API for obtaining the most common items in the counter.

  1. Add dedicated method k_most_common_ordered() to return a vector of the k most common items.
  2. Switch from a stable to an unstable sorting algorithm in most_common_tiebreaker().

Motivation

At present Counter offers no way for a user to specifically ask for the k most common items, where k may be less than the length of the counter n. While it is easy enough to do something like the following, such code would be quite inefficient if k is small compared to n:

let mut most_common = counter.most_common();
most_common.truncate(k);

The above code sorts the entire list of n items, which has complexity O(n * log n), when all we need to do is to select the top k most common items and then sort just those k items. Selecting just the top k items can be accomplished efficiently by using a binary heap. The resulting algorithm can be much more efficient than sorting all n items. For a fixed value of k, this algorithm scales with increasing n as n + O(log n), instead of O(n * log n) when sorting all n items. For the extreme case of k = 1, this requires only n - 1 comparisons. In the limit as k approaches n, this algorithm approaches a heapsort of the n items.

This is precisely what Python's implementation in collections.Counter.most_common() does, calling heapq.nlargest(). For a more detailed analysis of the complexity of the algorithm, see these notes in the Python source code. The implementation here is fairly similar.

For counter-rs there is an additional advantage to providing a dedicated API for the k most common items. Since the existing most_common_*() functions return owned values of the key type T and therefore have to clone the keys, a method for obtaining only the top k most common items could get away with cloning only k keys instead of all n.

It would be nice to add a method which implements this more efficient algorithm when only the top k most common items are desired. A check of some dependents of counter-rs shows 2 of the 9 listed crates using Counter::most_common() for the purpose of finding the single most common item: see examples here and here.

Finally, most_common_tiebreaker() uses the stable slice::sort_by() algorithm (based on timsort) instead of slice::sort_unstable_by() (based on quicksort). There is a definite performance improvement to be had by switching to the unstable quicksort algorithm. This is recommended in the documentation in cases where the stability guarantee is not needed. In the case of Counter, there is no reason to preserve the relative pre-sort order of items which compare as equal, since this order is already unspecified due to the hashing function.

Benchmarks

I made some simple benchmarks to compare the times of most_common_ordered() and k_most_common_ordered() for various values of k, as well as to compare slice::sort_by() and slice::sort_unstable_by() in most_common_ordered() with counters of various lengths.

Some of the benchmarks use counters where the counts were just 0..n, so all of the counts are distinct and there are no ties. These cases used two different key types, usize and String, so that we test both with a Copy type and a type where there is a nontrivial cost to cloning the keys. The strings keys all had length 16.

Another set of benchmarks used a more realistic example by making counts of the frequencies of all the words appearing in The Complete Works of William Shakespeare. (Here "words" are continguous ASCII alphabetic characters seperated by non-alphabetic characters.) There are a total of 988,852 words, of which 25,469 are distinct. There are of course many ties, that is words having the same counts, and ties are much more common among low frequency words than among high frequency words. This situation is another advantage for k_most_common_ordered(): it has to break ties less often than does most_common_ordered() and tiebreaking is a more expensive comparision operation. This set of benchmarks used counters with both borrowed (&str) and owned (String) keys.

All times are in units of microseconds. All counts are of type usize.

stable vs unstable sorting

Comparison of Counter::<usize>::most_common_ordered() when using slice::sort_by() and slice::sort_unstable_by() for counters of size n:

n stable unstable % change
100 2.8533 2.1542 -25.070
1000 42.303 25.604 -40.206
10000 559.32 294.52 -48.002
100000 6668.8 3426.1 -48.638

Similar tests except the keys are of type String and all keys have length 16:

n stable unstable % change
100 5.1853 4.8483 -11.408
1000 64.252 58.435 -11.118
10000 745.59 657.32 -16.144
100000 9100.4 6965.4 -23.460

Comparison using the counters containing the frequencies of words in Shakespeare's complete works:

T stable unstable % change
&str 4889.7 2814.5 -42.393
String 5489.4 4083.6 -25.610

k most common items

These benchmarks use counters of size 10000 with usize and String keys. We compare the times of k_most_common_ordered() for various values of k against those of most_common_ordered() when using both the stable and unstable sorts.

T 1 10 100 1000 5000 stable unstable
usize 31.341 32.882 45.239 136.90 366.11 559.32 294.52
String 20.259 24.331 37.139 153.75 473.45 745.59 657.32

Similar comparisons using counters of the frequencies of words in Shakespeare's complete works (25469 keys):

T 1 10 100 1000 10000 stable unstable
&str 44.445 45.209 64.230 436.42 3929.9 4889.7 2814.5
String 44.347 45.695 66.491 475.45 4146.6 5489.4 4083.6

Functionally this method is equivalent to the following:

    let mut items = counter.most_common_ordered();
    items.truncate(k);
    items

However, by using a binary heap to keep only the `k` most common items,
the implementation can be more efficient, often much more so,
than sorting all the items.
There is no reason to preserve the relative order of items that compare
equal, since that order is already unspecified due to the hashing used
by the `HashMap`.  The unstable sorting algorithm (quicksort) is
typically faster than the stable sorting algorithm (timsort) and does
not require the temporary storage used by the latter.

Benchmarks comparing `Counter::most_common_ordered()` and
`Counter::k_most_common_ordered(n)`, where `n` is the length of the
counter, found that the heapsort used by `k_most_common_ordered()` was
typically faster than the stable sort used by `most_common_ordered()`.
However, this is no longer the case when switching to the unstable sort;
the quicksort implementation is typically a little faster than the
heapsort, so now it is better for `k_most_common_ordered()` to call
`most_common_ordered()` with `k >= n`.
@coriolinus
Copy link
Owner

Wow! I don't have time right now to look at this, but I will attempt to get a proper review in within one week. If I bust that deadline, please feel free to drop a note here to remind me.

I hadn't previously used BinaryHeap::peek_mut, and wasn't sure about
its semantics when changing the value, so I wanted to ensure that
we really weren't going to lose any data.

With this test, I am pretty confident in this implementation.
Copy link
Owner

@coriolinus coriolinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good work! Thank you for contributing it, and in particular for the thorough documentation.


// Step 3. Sort the items in the heap with the second phases of heapsort. The number of
// comparisons is 2 * k * log2(k) + O(k).
heap.into_sorted_vec()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad about rust-lang/rust#76234, huh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. That would be useful here.

Incidentally, I orginally used .into_vec() followed by slice::sort_unstable() instead of .into_sorted_vec() as suggested in the discussion on that PR:

let mut topk = heap.into_vec();
topk.sort_unstable();

In my benchmarks, for small heaps .into_sorted_vec() is just slightly faster, but as the heap size grows it becomes faster to take the vector from the heap and use slice::sort_unstable(). I was on the fence for a while about what to do here. In the end I thought it was a little more direct to just use .into_sorted_vec(). Plus, then I realized that as k approaches n the algorithm becomes a heapsort. So if nothing else it is a little easier to think about what is happening, at least for me.

Anyway, it would certainly be an option to take the vector from the heap and sort it instead of finishing the heapsort in step 3 if you like that approach better. But as long as k is small compared to n it doesn't matter which way we do it, it is much better than sorting the entire counter.

@coriolinus coriolinus merged commit dab9b14 into coriolinus:master Jul 16, 2022
@clint-white
Copy link
Contributor Author

Thanks for reviewing it and for the feedback, as well as for adding the better test.

@clint-white clint-white deleted the k-most-common-ordered branch October 1, 2022 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants