-
-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
median action in persistence extensions #4345
Conversation
Signed-off-by: Mark Herwege <[email protected]>
I agree that the group function should use the same code so we don't duplicate code. I came up with a slightly different implementation. I'm still working on writing a benchmark using jmh. I'll plug in your version too and see. Isn't that what copilot suggested? I was also thinking about not having to call quick select twice for even numbered data, but haven't yet tried to figure out how. |
No, didn’t use copilot, but used the links provided and another Python implementation as reference. Mimicking that would copy the array too many times though and I wanted to avoid copying.
I had a similar thought, but could’t figure it out either. The challenge is the resulting list is still not sorted, although the number of iterations over it should be less. |
Very interesting! My VSCode copilot originally suggested almost the exact code that you submitted here. I didn't like it because of all the swaps it's doing. I'm curious to see the benchmark results, but I've gotta go out all day today, so can't play with it yet :( |
Yes, once quickselect has tossed the lower half in looking for k, we only need to look for k+1 in the upper half. |
It’s k-1 and in the lower part you need to look I think. Actually, you can also search for the max value in that lower part as the second value. But another quickSelect on half of the list may be more efficient as the lower part may already have some sorting. That’s not predictable though, while it is for max. |
I first coded something without the swaps, but then it creates copies of arrays and subarrays (sometimes using streams, making the resulting list immutable in most cases. I think it is much less memory efficient and adds extra overhead creating and garbage collecting objects. Copilot probably got this from github. One of the links I provided as reference in the code comments (which I mostly used) was from a github gist. |
Signed-off-by: Mark Herwege <[email protected]>
@jimtng I put in an improvement to limit searching to part of the list for the second value. |
I've run some benchmarks, and your quickselect + swap is the winner! Stream/sorted isn't a slouch either, but the quickselect is definitely faster. My version with partitioning did worse than stream/sorted, perhaps it's doing a lot of list insert during the partitioning process. One thing that can be tweaked in your implementation is instead of using a random pivot, just pick The result is the throughput (operations/s), the higher the better.
Here's that data with the min/avg/max/stddev.
Suggestions:
|
Signed-off-by: Mark Herwege <[email protected]>
@jimtng Great. I will work on that. Notice I did some further improvements in the mean time to avoid running through quickSelect twice, keeping track of the one but value and putting it in the k-1 place when partioning. There are now also tests specifically testing the algorithm.
Makes sense, the random function is probably too expensive.
The algorithm can be used for percentiles as well, but we can then extend the class if needed. For testing, I may opt to keep the method with package visibility. |
Here's the benchmark project / code, in case you want to play with it / change it around: |
Signed-off-by: Mark Herwege <[email protected]>
Signed-off-by: Mark Herwege <[email protected]>
Signed-off-by: Mark Herwege <[email protected]>
@jimtng The class is now external. The PR checks seem to be messed up at the moment, but I think it should be OK. |
Core build with this PR succeeded. For some reason unknown to me, the CI builds failed due to timeouts and the DCO check never finished (I believe all commits are signed). As the core build succeeds, this looks like a build infrastructure issue rather than a code issue. |
This pull request has been mentioned on openHAB Community. There might be relevant details there: https://community.openhab.org/t/feature-request-median/157728/25 |
* @return median of the values, null if the list is empty | ||
*/ | ||
public static @Nullable BigDecimal median(List<BigDecimal> inputList) { | ||
ArrayList<BigDecimal> bdList = new ArrayList<>(inputList); // Make a copy that will get reordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could skip this step, and document that this method will mutate the inputList? Then if the caller doesn't want its list mutated, the caller needs to create a copy of their list before passing to median()? This avoid having to allocate the list twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could. But it would also force us to check the provided inputList is mutable. Copying it makes sure of that. Also what happens if the input list is changed in another thread?
Copying it avoids all these problems. There is indeed a overhead, so it is worth considering, but I am not sure at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying locally seems more safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general question: Quickselect has an average complexity of O(n) and a worst case of O(n²). The standard java sort algorithm is merge sort with a constant complexity of O(n log n). Do we really expect so many datapoints that the difference between n and n log n matters? This is surely not the case for group functions, I would expect the number of elements to be between 10 and 20 in nearly all cases. For persistence it might be different, but does a factor of 2 or so on the average case really outweigh code complexity?
|
||
private static @Nullable State internalMedianBetween(Item item, @Nullable ZonedDateTime begin, | ||
@Nullable ZonedDateTime end, @Nullable String serviceId) { | ||
String effectiveServiceId = serviceId == null ? getDefaultServiceId() : serviceId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String effectiveServiceId = serviceId == null ? getDefaultServiceId() : serviceId; | |
String effectiveServiceId = Objects.requireNonNullElse(serviceId, getDefaultServiceId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work, getDefaultServiceId() is Nullable as well.
return null; | ||
} | ||
ZonedDateTime now = ZonedDateTime.now(); | ||
ZonedDateTime beginTime = begin == null ? now : begin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZonedDateTime beginTime = begin == null ? now : begin; | |
ZonedDateTime beginTime = Objects.requireNonNullElse(begin, now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
} | ||
ZonedDateTime now = ZonedDateTime.now(); | ||
ZonedDateTime beginTime = begin == null ? now : begin; | ||
ZonedDateTime endTime = end == null ? now : end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZonedDateTime endTime = end == null ? now : end; | |
ZonedDateTime endTime = Objects.requireNonNullElse(end, now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Item baseItem = item; | ||
if (baseItem instanceof GroupItem groupItem) { | ||
baseItem = groupItem.getBaseItem(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Item baseItem = item; | |
if (baseItem instanceof GroupItem groupItem) { | |
baseItem = groupItem.getBaseItem(); | |
} | |
Item baseItem = item instanceof GroupItem groupItem ? groupItem.getBaseItem() : item; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
if (baseItem instanceof GroupItem groupItem) { | ||
baseItem = groupItem.getBaseItem(); | ||
} | ||
Unit<?> unit = baseItem instanceof NumberItem numberItem ? numberItem.getUnit() : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case baseItem
is not NumberItem
: shouldn't the method log a warning and return immediuately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it will not work for DimmerItem, RollershutterItem ... anymore. In theory, also SwitchItem would work, but probably not very meaningful.
None of the other persistence extension methods have this check, so I don't see why I would introduce it here.
@NonNullByDefault | ||
public class Statistics { | ||
|
||
public static boolean randomQuickSelectSeed = false; // Can be enabled to always create a random pivot index for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the benefit if that? I can't imagine a use-case for where random pivots would be beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was easy for testing the algorithm, and allowed to stay close to the textbook algorithm. Removed.
* @return median of the values, null if the list is empty | ||
*/ | ||
public static @Nullable BigDecimal median(List<BigDecimal> inputList) { | ||
ArrayList<BigDecimal> bdList = new ArrayList<>(inputList); // Make a copy that will get reordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying locally seems more safe.
Signed-off-by: Mark Herwege <[email protected]>
Signed-off-by: Mark Herwege <[email protected]>
I agree the difference would be small in the group function, although it is something that has to be recalculated quite often. Updating item states as fast as possible is quite central, so I think time savings there are potentially still relevant. As median is the only algorithm in the persistence extensions that needs access to the full list (and not a stream of values or iterable) to be able to calculate, I think, ultimately the bigger risk could become memory impact. But avoiding that would again require another type of algorithm (heap based), that starts making approximations beyond a certain number of values. Quickselect in general is faster than merge sort, and has a similar memory impact as merge sort. Is it worth it? I agree it is debateable. I was concerned about having to do a complex operation on a large list of values and came up with this, which works, has tests and performs better. I don't think it is a lot of code, but there is of course a maintenance consequence to it. |
I wrote a basic benchmark using jmh. I've included the standard stream sort() to compare. It operates on 100 random data points (99 for odd set)
So the Quick select is about 2x more performant, but it also has a bit of "randomness" in its performance (higher error / variance) Here's more details on quickselect
As you can see, the worst case encountered in the test run, quickselect is still 30-50% faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This pull request has been mentioned on openHAB Community. There might be relevant details there: https://community.openhab.org/t/jruby-scripting-official-helper-library/145072/22 |
Closes #4342
@jimtng FYI The median calculation uses a quickSelect algorithm. Would it make sense to put that somewhere in a utils class to be reusable for group median calculations? I just don't know where to put it.