median action in persistence extensions #4345

mherwege · 2024-08-12T17:36:25Z

@jimtng FYI The median calculation uses a quickSelect algorithm. Would it make sense to put that somewhere in a utils class to be reusable for group median calculations? I just don't know where to put it.

Signed-off-by: Mark Herwege <[email protected]>

jimtng · 2024-08-12T21:01:49Z

I agree that the group function should use the same code so we don't duplicate code.

I came up with a slightly different implementation. I'm still working on writing a benchmark using jmh. I'll plug in your version too and see. Isn't that what copilot suggested?

I was also thinking about not having to call quick select twice for even numbered data, but haven't yet tried to figure out how.

mherwege · 2024-08-12T21:41:54Z

Isn't that what copilot suggested?

No, didn’t use copilot, but used the links provided and another Python implementation as reference. Mimicking that would copy the array too many times though and I wanted to avoid copying.

thinking about not having to call quick select twice for even numbered data, but haven't yet tried to figure out how

I had a similar thought, but could’t figure it out either. The challenge is the resulting list is still not sorted, although the number of iterations over it should be less.

jimtng · 2024-08-12T21:54:50Z

Very interesting! My VSCode copilot originally suggested almost the exact code that you submitted here. I didn't like it because of all the swaps it's doing. I'm curious to see the benchmark results, but I've gotta go out all day today, so can't play with it yet :(

jimtng · 2024-08-12T21:56:24Z

The challenge is the resulting list is still not sorted, although the number of iterations over it should be less.

Yes, once quickselect has tossed the lower half in looking for k, we only need to look for k+1 in the upper half.

mherwege · 2024-08-13T05:28:00Z

Yes, once quickselect has tossed the lower half in looking for k, we only need to look for k+1 in the upper half.

It’s k-1 and in the lower part you need to look I think. Actually, you can also search for the max value in that lower part as the second value. But another quickSelect on half of the list may be more efficient as the lower part may already have some sorting. That’s not predictable though, while it is for max.

mherwege · 2024-08-13T05:31:03Z

Very interesting! My VSCode copilot originally suggested almost the exact code that you submitted here. I didn't like it because of all the swaps it's doing. I'm curious to see the benchmark results, but I've gotta go out all day today, so can't play with it yet :(

I first coded something without the swaps, but then it creates copies of arrays and subarrays (sometimes using streams, making the resulting list immutable in most cases. I think it is much less memory efficient and adds extra overhead creating and garbage collecting objects.

Copilot probably got this from github. One of the links I provided as reference in the code comments (which I mostly used) was from a github gist.

Signed-off-by: Mark Herwege <[email protected]>

mherwege · 2024-08-13T06:27:12Z

@jimtng I put in an improvement to limit searching to part of the list for the second value.

jimtng · 2024-08-13T10:33:46Z

I've run some benchmarks, and your quickselect + swap is the winner! Stream/sorted isn't a slouch either, but the quickselect is definitely faster. My version with partitioning did worse than stream/sorted, perhaps it's doing a lot of list insert during the partitioning process.

One thing that can be tweaked in your implementation is instead of using a random pivot, just pick left (the first item). This results in a significant improvement.

The result is the throughput (operations/s), the higher the better.

Benchmark                                                    Mode  Cnt       Score       Error  Units
MedianBenchmark.medianByQuickSelectWithSwapEvenFirstPivot   thrpt   25  222721.917 ± 37755.022  ops/s
MedianBenchmark.medianByQuickSelectWithSwapEvenRandomPivot  thrpt   25  131946.607 ±  5721.808  ops/s
MedianBenchmark.medianByQuickSelectWithSwapOddFirstPivot    thrpt   25  209490.898 ± 55048.181  ops/s
MedianBenchmark.medianByQuickSelectWithSwapOddRandomPivot   thrpt   25  124596.605 ±  5639.394  ops/s

Here's that data with the min/avg/max/stddev.

Benchmark                                                    Mode  Cnt       Score       Error  Units
MedianBenchmark.medianByQuickSelectWithSwapEvenFirstPivot   thrpt   25  222721.917 ± 37755.022  ops/s
  (min, avg, max) = (155523.695, 222721.917, 292013.368), stdev = 50401.874
MedianBenchmark.medianByQuickSelectWithSwapEvenRandomPivot  thrpt   25  131946.607 ±  5721.808  ops/s
  (min, avg, max) = (122720.078, 131946.607, 143127.277), stdev = 7638.451
MedianBenchmark.medianByQuickSelectWithSwapOddFirstPivot    thrpt   25  209490.898 ± 55048.181  ops/s
  (min, avg, max) = (134181.737, 209490.898, 330302.120), stdev = 73487.746
MedianBenchmark.medianByQuickSelectWithSwapOddRandomPivot   thrpt   25  124596.605 ±  5639.394  ops/s
  (min, avg, max) = (115113.769, 124596.605, 139453.563), stdev = 7528.429

Suggestions:

Change random pivot to simply picking the first/left element.
Make a class called Statistics (or some other, better name?), and have it have a public BigDecimal median(List<BigDecimal>) method
The QuickSelect may be a hidden / private class (or method) of this Statistics. What we need is the median, not QuickSelect, unless you can think of other uses?
Put this class in org.openhab.core/utils alongside HexUtils, ColorUtils, so it can be used by the group function too.

Signed-off-by: Mark Herwege <[email protected]>

mherwege · 2024-08-13T11:34:37Z

@jimtng Great. I will work on that. Notice I did some further improvements in the mean time to avoid running through quickSelect twice, keeping track of the one but value and putting it in the k-1 place when partioning. There are now also tests specifically testing the algorithm.

Change random pivot to simply picking the first/left element.

Makes sense, the random function is probably too expensive.

The QuickSelect may be a hidden / private class (or method) of this Statistics. What we need is the median, not QuickSelect, unless you can think of other uses?

The algorithm can be used for percentiles as well, but we can then extend the class if needed. For testing, I may opt to keep the method with package visibility.

jimtng · 2024-08-13T12:35:28Z

Benchmark                                                         Mode  Cnt       Score       Error  Units
MedianBenchmark.medianByQuickSelectForcePreviousOddFirstPivot    thrpt   25  186976.530 ± 49105.006  ops/s
MedianBenchmark.medianByQuickSelectForcePreviouspEvenFirstPivot  thrpt   25  203130.562 ± 29159.844  ops/s
MedianBenchmark.medianByQuickSelectWithSwapEvenFirstPivot        thrpt   25  196479.397 ± 69621.600  ops/s
MedianBenchmark.medianByQuickSelectWithSwapOddFirstPivot         thrpt   25  180868.419 ± 33301.633  ops/s

Here's the benchmark project / code, in case you want to play with it / change it around:
https:/jimtng/MedianBenchmark

Signed-off-by: Mark Herwege <[email protected]>

mherwege · 2024-08-13T13:14:28Z

@jimtng The class is now external. The PR checks seem to be messed up at the moment, but I think it should be OK.

mherwege · 2024-08-14T07:37:54Z

Core build with this PR succeeded. For some reason unknown to me, the CI builds failed due to timeouts and the DCO check never finished (I believe all commits are signed). As the core build succeeds, this looks like a build infrastructure issue rather than a code issue.

openhab-bot · 2024-08-14T11:16:02Z

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/feature-request-median/157728/25

jimtng · 2024-08-14T11:44:04Z

bundles/org.openhab.core/src/main/java/org/openhab/core/util/Statistics.java

+ * @return median of the values, null if the list is empty
+ */
+ public static @Nullable BigDecimal median(List<BigDecimal> inputList) {
+ ArrayList<BigDecimal> bdList = new ArrayList<>(inputList); // Make a copy that will get reordered


Do you think we could skip this step, and document that this method will mutate the inputList? Then if the caller doesn't want its list mutated, the caller needs to create a copy of their list before passing to median()? This avoid having to allocate the list twice.

Yes, we could. But it would also force us to check the provided inputList is mutable. Copying it makes sure of that. Also what happens if the input list is changed in another thread?
Copying it avoids all these problems. There is indeed a overhead, so it is worth considering, but I am not sure at the moment.

Copying locally seems more safe.

J-N-K

As a general question: Quickselect has an average complexity of O(n) and a worst case of O(n²). The standard java sort algorithm is merge sort with a constant complexity of O(n log n). Do we really expect so many datapoints that the difference between n and n log n matters? This is surely not the case for group functions, I would expect the number of elements to be between 10 and 20 in nearly all cases. For persistence it might be different, but does a factor of 2 or so on the average case really outweigh code complexity?

J-N-K · 2024-08-18T12:47:39Z