Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a new auto-interval date histogram #26659

Closed
wants to merge 7 commits into from
Closed

Adds a new auto-interval date histogram #26659

wants to merge 7 commits into from

Conversation

colings86
Copy link
Contributor

@colings86 colings86 commented Sep 15, 2017

This change adds a new type of histogram aggregation called auto_date_histogram where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in breadth_first mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

This is also the first aggregation to merge buckets at collection time but having this aggregation will open the door to other such aggregations that merge buckets at collection time.

Things still to do:

  • Write documentation on the new aggregation and its parameters
  • Write more tests, specifically around having sub-aggregations and checking they are merged correctly in the reduce step
  • Investigate whether we could collect 10 * target buckets on the shards and then use the extra bucket to create buckets representing multiples of the interval in the cases where we have too many buckets for one interval but if we increase the interval we get only 1 (or a few) bucket(s) (e.g. if the target buckets is 10 and we end up with 30 x minute level buckets if we increase the rounding we'll get 1 x hour bucket so instead can we merge every 3 minute level buckets to get 10 x 3 minute buckets?) - this will probably be moved out into a separate issue to be tackled when this first pass is merged
  • Potentially optimise the logic a bit as I'm sure there is room for improvements.
  • Add an equivalent auto_histogram aggregation to work on numeric (rather than date) fields - again this might be moved into a separate issue
  • Add a test to ensure that if a user requests more than 10k buckets, we return an error.
  • Fix doc tests

Closes #9572

@colings86
Copy link
Contributor Author

@jpountz this is still WIP but I think most of the main bits are there so would be great to get a review to make sure its going in the right direction.

@colings86 colings86 removed the WIP label Sep 26, 2017
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me overall. One thought I had when reviewing this PR is that you used deferring in order not to have to implement merging on all aggregations, but since histograms return all buckets all the time anyway, deferring is more likely to increase memory usage than to decrease it? Also should it be experimental for now?

public final void mergeBuckets(long[] mergeMap, long newNumBuckets) {
try (IntArray oldDocCounts = docCounts) {
docCounts = bigArrays.newIntArray(newNumBuckets, true);
docCounts.fill(0, newNumBuckets, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed since you passed true on the previous line

throw new IllegalArgumentException(NUM_BUCKETS_FIELD.getPreferredName() + " must be greater than 0 for [" + name + "]");
}
this.numBuckets = numBuckets;
return this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should probably be > 1 actually, no? A single bucket is not useful for a histogram.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a soft limit for the number of buckets too, something like 1000?

private final Rounding[] roundings;
private int roundingIdx = 0;

private LongHash bucketOrds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the number of buckets is contained anyway, we might want to use a LongToIntHashMap instead, which has less overhead

collectExistingBucket(sub, doc, bucketOrd);
} else {
collectBucket(sub, doc, bucketOrd);
while (bucketOrds.size() > targetBuckets) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to make sure we are not on the last level of rounding already? Otherwise passing numBuckets = 2 and 3 dates that are in 3 different centuries could cause out-of-bounds exceptions?

} else {
collectBucket(sub, doc, bucketOrd);
while (bucketOrds.size() > targetBuckets) {
increaseRounding();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this operation does two things:

  • compute the merge map (cheap since the number of buckets is small)
  • recompute buffered buckets (potentially expensive since the number of docs is not bounded)

Yet the merge map alone is enough to know how many buckets we will have in the end, so I think it should do something like this instead. I'm unsure how often it would be applied in practice, but it doesn't make things more complex and would be safer?

if (bucketOrds.size() > targetBuckets) {
  long[] mergeMap;
  do {
    mergeMap = increaseRounding();
  } while (mergeMap.length > targetBuckets);
  // now renumber buckets
}

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

Closes #9572
@colings86
Copy link
Contributor Author

closing in favour of #28993 (same code but branch living on origin so @pcsanwald and I can work on the branch together)

@colings86 colings86 closed this Mar 12, 2018
malept added a commit to malept/elasticsearch that referenced this pull request Jun 27, 2018
elastic#26659 was closed, not merged. (The followup PR didn't make it.)
@colings86 colings86 removed the v6.3.0 label Jun 28, 2018
@colings86 colings86 removed the v7.0.0 label Jun 28, 2018
colings86 pushed a commit that referenced this pull request Jun 28, 2018
#26659 was closed, not merged. (The followup PR didn't make it.)
colings86 pushed a commit that referenced this pull request Jun 28, 2018
#26659 was closed, not merged. (The followup PR didn't make it.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants