Add method to return the k
most common items and speed up most_common_*()
methods
#25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR proposes the following changes to improve the flexibility and efficiency of
Counter
's API for obtaining the most common items in the counter.k_most_common_ordered()
to return a vector of the k most common items.most_common_tiebreaker()
.Motivation
At present
Counter
offers no way for a user to specifically ask for the k most common items, where k may be less than the length of the counter n. While it is easy enough to do something like the following, such code would be quite inefficient if k is small compared to n:The above code sorts the entire list of n items, which has complexity O(n * log n), when all we need to do is to select the top k most common items and then sort just those k items. Selecting just the top k items can be accomplished efficiently by using a binary heap. The resulting algorithm can be much more efficient than sorting all n items. For a fixed value of k, this algorithm scales with increasing n as n + O(log n), instead of O(n * log n) when sorting all n items. For the extreme case of k = 1, this requires only n - 1 comparisons. In the limit as k approaches n, this algorithm approaches a heapsort of the n items.
This is precisely what Python's implementation in
collections.Counter.most_common()
does, callingheapq.nlargest()
. For a more detailed analysis of the complexity of the algorithm, see these notes in the Python source code. The implementation here is fairly similar.For
counter-rs
there is an additional advantage to providing a dedicated API for the k most common items. Since the existingmost_common_*()
functions return owned values of the key typeT
and therefore have to clone the keys, a method for obtaining only the top k most common items could get away with cloning only k keys instead of all n.It would be nice to add a method which implements this more efficient algorithm when only the top k most common items are desired. A check of some dependents of
counter-rs
shows 2 of the 9 listed crates usingCounter::most_common()
for the purpose of finding the single most common item: see examples here and here.Finally,
most_common_tiebreaker()
uses the stableslice::sort_by()
algorithm (based on timsort) instead ofslice::sort_unstable_by()
(based on quicksort). There is a definite performance improvement to be had by switching to the unstable quicksort algorithm. This is recommended in the documentation in cases where the stability guarantee is not needed. In the case ofCounter
, there is no reason to preserve the relative pre-sort order of items which compare as equal, since this order is already unspecified due to the hashing function.Benchmarks
I made some simple benchmarks to compare the times of
most_common_ordered()
andk_most_common_ordered()
for various values ofk
, as well as to compareslice::sort_by()
andslice::sort_unstable_by()
inmost_common_ordered()
with counters of various lengths.Some of the benchmarks use counters where the counts were just
0..n
, so all of the counts are distinct and there are no ties. These cases used two different key types,usize
andString
, so that we test both with aCopy
type and a type where there is a nontrivial cost to cloning the keys. The strings keys all had length 16.Another set of benchmarks used a more realistic example by making counts of the frequencies of all the words appearing in The Complete Works of William Shakespeare. (Here "words" are continguous ASCII alphabetic characters seperated by non-alphabetic characters.) There are a total of 988,852 words, of which 25,469 are distinct. There are of course many ties, that is words having the same counts, and ties are much more common among low frequency words than among high frequency words. This situation is another advantage for
k_most_common_ordered()
: it has to break ties less often than doesmost_common_ordered()
and tiebreaking is a more expensive comparision operation. This set of benchmarks used counters with both borrowed (&str
) and owned (String
) keys.All times are in units of microseconds. All counts are of type
usize
.stable vs unstable sorting
Comparison of
Counter::<usize>::most_common_ordered()
when usingslice::sort_by()
andslice::sort_unstable_by()
for counters of sizen
:n
Similar tests except the keys are of type
String
and all keys have length 16:n
Comparison using the counters containing the frequencies of words in Shakespeare's complete works:
T
&str
String
k most common items
These benchmarks use counters of size 10000 with
usize
andString
keys. We compare the times ofk_most_common_ordered()
for various values ofk
against those ofmost_common_ordered()
when using both the stable and unstable sorts.T
usize
String
Similar comparisons using counters of the frequencies of words in Shakespeare's complete works (25469 keys):
T
&str
String