Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

median action in persistence extensions #4345

Merged
merged 8 commits into from
Aug 24, 2024

Conversation

mherwege
Copy link
Contributor

Closes #4342

@jimtng FYI The median calculation uses a quickSelect algorithm. Would it make sense to put that somewhere in a utils class to be reusable for group median calculations? I just don't know where to put it.

Signed-off-by: Mark Herwege <[email protected]>
@mherwege mherwege requested a review from a team as a code owner August 12, 2024 17:36
@jimtng
Copy link
Contributor

jimtng commented Aug 12, 2024

I agree that the group function should use the same code so we don't duplicate code.

I came up with a slightly different implementation. I'm still working on writing a benchmark using jmh. I'll plug in your version too and see. Isn't that what copilot suggested?

I was also thinking about not having to call quick select twice for even numbered data, but haven't yet tried to figure out how.

@mherwege
Copy link
Contributor Author

Isn't that what copilot suggested?

No, didn’t use copilot, but used the links provided and another Python implementation as reference. Mimicking that would copy the array too many times though and I wanted to avoid copying.

thinking about not having to call quick select twice for even numbered data, but haven't yet tried to figure out how

I had a similar thought, but could’t figure it out either. The challenge is the resulting list is still not sorted, although the number of iterations over it should be less.

@jimtng
Copy link
Contributor

jimtng commented Aug 12, 2024

Very interesting! My VSCode copilot originally suggested almost the exact code that you submitted here. I didn't like it because of all the swaps it's doing. I'm curious to see the benchmark results, but I've gotta go out all day today, so can't play with it yet :(

@jimtng
Copy link
Contributor

jimtng commented Aug 12, 2024

The challenge is the resulting list is still not sorted, although the number of iterations over it should be less.

Yes, once quickselect has tossed the lower half in looking for k, we only need to look for k+1 in the upper half.

@mherwege
Copy link
Contributor Author

Yes, once quickselect has tossed the lower half in looking for k, we only need to look for k+1 in the upper half.

It’s k-1 and in the lower part you need to look I think. Actually, you can also search for the max value in that lower part as the second value. But another quickSelect on half of the list may be more efficient as the lower part may already have some sorting. That’s not predictable though, while it is for max.

@mherwege
Copy link
Contributor Author

mherwege commented Aug 13, 2024

Very interesting! My VSCode copilot originally suggested almost the exact code that you submitted here. I didn't like it because of all the swaps it's doing. I'm curious to see the benchmark results, but I've gotta go out all day today, so can't play with it yet :(

I first coded something without the swaps, but then it creates copies of arrays and subarrays (sometimes using streams, making the resulting list immutable in most cases. I think it is much less memory efficient and adds extra overhead creating and garbage collecting objects.

Copilot probably got this from github. One of the links I provided as reference in the code comments (which I mostly used) was from a github gist.

Signed-off-by: Mark Herwege <[email protected]>
@mherwege
Copy link
Contributor Author

@jimtng I put in an improvement to limit searching to part of the list for the second value.

@jimtng
Copy link
Contributor

jimtng commented Aug 13, 2024

I've run some benchmarks, and your quickselect + swap is the winner! Stream/sorted isn't a slouch either, but the quickselect is definitely faster. My version with partitioning did worse than stream/sorted, perhaps it's doing a lot of list insert during the partitioning process.

One thing that can be tweaked in your implementation is instead of using a random pivot, just pick left (the first item). This results in a significant improvement.

The result is the throughput (operations/s), the higher the better.

Benchmark                                                    Mode  Cnt       Score       Error  Units
MedianBenchmark.medianByQuickSelectWithSwapEvenFirstPivot   thrpt   25  222721.917 ± 37755.022  ops/s
MedianBenchmark.medianByQuickSelectWithSwapEvenRandomPivot  thrpt   25  131946.607 ±  5721.808  ops/s
MedianBenchmark.medianByQuickSelectWithSwapOddFirstPivot    thrpt   25  209490.898 ± 55048.181  ops/s
MedianBenchmark.medianByQuickSelectWithSwapOddRandomPivot   thrpt   25  124596.605 ±  5639.394  ops/s

Here's that data with the min/avg/max/stddev.

Benchmark                                                    Mode  Cnt       Score       Error  Units
MedianBenchmark.medianByQuickSelectWithSwapEvenFirstPivot   thrpt   25  222721.917 ± 37755.022  ops/s
  (min, avg, max) = (155523.695, 222721.917, 292013.368), stdev = 50401.874
MedianBenchmark.medianByQuickSelectWithSwapEvenRandomPivot  thrpt   25  131946.607 ±  5721.808  ops/s
  (min, avg, max) = (122720.078, 131946.607, 143127.277), stdev = 7638.451
MedianBenchmark.medianByQuickSelectWithSwapOddFirstPivot    thrpt   25  209490.898 ± 55048.181  ops/s
  (min, avg, max) = (134181.737, 209490.898, 330302.120), stdev = 73487.746
MedianBenchmark.medianByQuickSelectWithSwapOddRandomPivot   thrpt   25  124596.605 ±  5639.394  ops/s
  (min, avg, max) = (115113.769, 124596.605, 139453.563), stdev = 7528.429

Suggestions:

  • Change random pivot to simply picking the first/left element.
  • Make a class called Statistics (or some other, better name?), and have it have a public BigDecimal median(List<BigDecimal>) method
  • The QuickSelect may be a hidden / private class (or method) of this Statistics. What we need is the median, not QuickSelect, unless you can think of other uses?
  • Put this class in org.openhab.core/utils alongside HexUtils, ColorUtils, so it can be used by the group function too.

@mherwege
Copy link
Contributor Author

mherwege commented Aug 13, 2024

@jimtng Great. I will work on that. Notice I did some further improvements in the mean time to avoid running through quickSelect twice, keeping track of the one but value and putting it in the k-1 place when partioning. There are now also tests specifically testing the algorithm.

  • Change random pivot to simply picking the first/left element.

Makes sense, the random function is probably too expensive.

  • The QuickSelect may be a hidden / private class (or method) of this Statistics. What we need is the median, not QuickSelect, unless you can think of other uses?

The algorithm can be used for percentiles as well, but we can then extend the class if needed. For testing, I may opt to keep the method with package visibility.

@jimtng
Copy link
Contributor

jimtng commented Aug 13, 2024

Benchmark                                                         Mode  Cnt       Score       Error  Units
MedianBenchmark.medianByQuickSelectForcePreviousOddFirstPivot    thrpt   25  186976.530 ± 49105.006  ops/s
MedianBenchmark.medianByQuickSelectForcePreviouspEvenFirstPivot  thrpt   25  203130.562 ± 29159.844  ops/s
MedianBenchmark.medianByQuickSelectWithSwapEvenFirstPivot        thrpt   25  196479.397 ± 69621.600  ops/s
MedianBenchmark.medianByQuickSelectWithSwapOddFirstPivot         thrpt   25  180868.419 ± 33301.633  ops/s

Here's the benchmark project / code, in case you want to play with it / change it around:
https:/jimtng/MedianBenchmark

Signed-off-by: Mark Herwege <[email protected]>
Signed-off-by: Mark Herwege <[email protected]>
@mherwege
Copy link
Contributor Author

@jimtng The class is now external. The PR checks seem to be messed up at the moment, but I think it should be OK.

@mherwege
Copy link
Contributor Author

Core build with this PR succeeded. For some reason unknown to me, the CI builds failed due to timeouts and the DCO check never finished (I believe all commits are signed). As the core build succeeds, this looks like a build infrastructure issue rather than a code issue.

@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/feature-request-median/157728/25

* @return median of the values, null if the list is empty
*/
public static @Nullable BigDecimal median(List<BigDecimal> inputList) {
ArrayList<BigDecimal> bdList = new ArrayList<>(inputList); // Make a copy that will get reordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could skip this step, and document that this method will mutate the inputList? Then if the caller doesn't want its list mutated, the caller needs to create a copy of their list before passing to median()? This avoid having to allocate the list twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could. But it would also force us to check the provided inputList is mutable. Copying it makes sure of that. Also what happens if the input list is changed in another thread?
Copying it avoids all these problems. There is indeed a overhead, so it is worth considering, but I am not sure at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying locally seems more safe.

Copy link
Member

@J-N-K J-N-K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general question: Quickselect has an average complexity of O(n) and a worst case of O(n²). The standard java sort algorithm is merge sort with a constant complexity of O(n log n). Do we really expect so many datapoints that the difference between n and n log n matters? This is surely not the case for group functions, I would expect the number of elements to be between 10 and 20 in nearly all cases. For persistence it might be different, but does a factor of 2 or so on the average case really outweigh code complexity?


private static @Nullable State internalMedianBetween(Item item, @Nullable ZonedDateTime begin,
@Nullable ZonedDateTime end, @Nullable String serviceId) {
String effectiveServiceId = serviceId == null ? getDefaultServiceId() : serviceId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String effectiveServiceId = serviceId == null ? getDefaultServiceId() : serviceId;
String effectiveServiceId = Objects.requireNonNullElse(serviceId, getDefaultServiceId());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, getDefaultServiceId() is Nullable as well.

return null;
}
ZonedDateTime now = ZonedDateTime.now();
ZonedDateTime beginTime = begin == null ? now : begin;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ZonedDateTime beginTime = begin == null ? now : begin;
ZonedDateTime beginTime = Objects.requireNonNullElse(begin, now);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}
ZonedDateTime now = ZonedDateTime.now();
ZonedDateTime beginTime = begin == null ? now : begin;
ZonedDateTime endTime = end == null ? now : end;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ZonedDateTime endTime = end == null ? now : end;
ZonedDateTime endTime = Objects.requireNonNullElse(end, now);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines 1696 to 1699
Item baseItem = item;
if (baseItem instanceof GroupItem groupItem) {
baseItem = groupItem.getBaseItem();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Item baseItem = item;
if (baseItem instanceof GroupItem groupItem) {
baseItem = groupItem.getBaseItem();
}
Item baseItem = item instanceof GroupItem groupItem ? groupItem.getBaseItem() : item;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

if (baseItem instanceof GroupItem groupItem) {
baseItem = groupItem.getBaseItem();
}
Unit<?> unit = baseItem instanceof NumberItem numberItem ? numberItem.getUnit() : null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case baseItem is not NumberItem: shouldn't the method log a warning and return immediuately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will not work for DimmerItem, RollershutterItem ... anymore. In theory, also SwitchItem would work, but probably not very meaningful.
None of the other persistence extension methods have this check, so I don't see why I would introduce it here.

@NonNullByDefault
public class Statistics {

public static boolean randomQuickSelectSeed = false; // Can be enabled to always create a random pivot index for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit if that? I can't imagine a use-case for where random pivots would be beneficial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was easy for testing the algorithm, and allowed to stay close to the textbook algorithm. Removed.

* @return median of the values, null if the list is empty
*/
public static @Nullable BigDecimal median(List<BigDecimal> inputList) {
ArrayList<BigDecimal> bdList = new ArrayList<>(inputList); // Make a copy that will get reordered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying locally seems more safe.

Signed-off-by: Mark Herwege <[email protected]>
Signed-off-by: Mark Herwege <[email protected]>
@mherwege
Copy link
Contributor Author

As a general question: Quickselect has an average complexity of O(n) and a worst case of O(n²). The standard java sort algorithm is merge sort with a constant complexity of O(n log n). Do we really expect so many datapoints that the difference between n and n log n matters? This is surely not the case for group functions, I would expect the number of elements to be between 10 and 20 in nearly all cases. For persistence it might be different, but does a factor of 2 or so on the average case really outweigh code complexity?

I agree the difference would be small in the group function, although it is something that has to be recalculated quite often. Updating item states as fast as possible is quite central, so I think time savings there are potentially still relevant.
For the persistence extensions, the impact can potentially be bigger. @jimtng has done some extensive benchmarking, but I don't see the comparison with a straigth sort in the discussion anymore. In the last commit (which I can revert when we close the discussion), I inserted some quick and dirty output of timings, comparing stream sort with quickselect. For a list size of 100, I consistently get between 6 and 12 times faster performance with quickselect. The difference becomes bigger with with more datapoints. And retrieving from persistence, the risk is the number of datapoints can be much higher than 100.

As median is the only algorithm in the persistence extensions that needs access to the full list (and not a stream of values or iterable) to be able to calculate, I think, ultimately the bigger risk could become memory impact. But avoiding that would again require another type of algorithm (heap based), that starts making approximations beyond a certain number of values. Quickselect in general is faster than merge sort, and has a similar memory impact as merge sort.

Is it worth it? I agree it is debateable. I was concerned about having to do a complex operation on a large list of values and came up with this, which works, has tests and performs better. I don't think it is a lot of code, but there is of course a maintenance consequence to it.

@jimtng
Copy link
Contributor

jimtng commented Aug 19, 2024

I wrote a basic benchmark using jmh. I've included the standard stream sort() to compare. It operates on 100 random data points (99 for odd set)

Benchmark                                                         Mode  Cnt       Score       Error  Units
MedianBenchmark.medianByQuickSelectForcePreviousOddFirstPivot    thrpt   25  237170.760 ± 32426.204  ops/s
MedianBenchmark.medianByQuickSelectForcePreviouspEvenFirstPivot  thrpt   25  192169.056 ± 31881.002  ops/s
MedianBenchmark.medianBySortingEven                              thrpt   25  107702.711 ±  3461.473  ops/s
MedianBenchmark.medianBySortingOdd                               thrpt   25  107778.953 ±  1670.704  ops/s

So the Quick select is about 2x more performant, but it also has a bit of "randomness" in its performance (higher error / variance)

Here's more details on quickselect

Result "org.openhab.MedianBenchmark.medianByQuickSelectForcePreviouspEvenFirstPivot":
  192169.056 ±(99.9%) 31881.002 ops/s [Average]
  (min, avg, max) = (135656.987, 192169.056, 234413.231), stdev = 42560.226
  CI (99.9%): [160288.054, 224050.058] (assumes normal distribution)

Result "org.openhab.MedianBenchmark.medianByQuickSelectForcePreviousOddFirstPivot":
  237170.760 ±(99.9%) 32426.204 ops/s [Average]
  (min, avg, max) = (159266.372, 237170.760, 275340.116), stdev = 43288.054
  CI (99.9%): [204744.557, 269596.964] (assumes normal distribution)

Result "org.openhab.MedianBenchmark.medianBySortingEven":
  107702.711 ±(99.9%) 3461.473 ops/s [Average]
  (min, avg, max) = (89381.755, 107702.711, 114202.386), stdev = 4620.967
  CI (99.9%): [104241.238, 111164.183] (assumes normal distribution)

Result "org.openhab.MedianBenchmark.medianBySortingOdd":
  107778.953 ±(99.9%) 1670.704 ops/s [Average]
  (min, avg, max) = (102813.303, 107778.953, 110791.120), stdev = 2230.342
  CI (99.9%): [106108.250, 109449.657] (assumes normal distribution)

As you can see, the worst case encountered in the test run, quickselect is still 30-50% faster

@J-N-K J-N-K added the enhancement An enhancement or new feature of the Core label Aug 24, 2024
Copy link
Member

@J-N-K J-N-K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@J-N-K J-N-K merged commit b63fa47 into openhab:main Aug 24, 2024
5 checks passed
@J-N-K J-N-K added this to the 4.3 milestone Aug 24, 2024
@mherwege mherwege deleted the persistence_ext_median branch August 24, 2024 09:51
@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/jruby-scripting-official-helper-library/145072/22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature of the Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Caclulate the median of persistent data.
4 participants