Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cardinality aggregation missing from response inconsistently #64142

Closed
noamkush opened this issue Oct 26, 2020 · 4 comments · Fixed by #64214
Closed

Cardinality aggregation missing from response inconsistently #64142

noamkush opened this issue Oct 26, 2020 · 4 comments · Fixed by #64214
Assignees
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@noamkush
Copy link

Elasticsearch version: 7.9.1 (reproducible on demo.elastic.co)

Description of the problem including expected versus actual behavior:
The response to the exact same search request sometimes contains the cardinality aggregation and sometimes it doesn't - the issue is transient.

Steps to reproduce:
The following request (on demo.elastic.co):

POST _search?request_cache=false
{
  "aggs": {
    "date_range": {
      "aggs": {
        "unique_users": {
          "cardinality": {
            "field": "agent.name"
          }
        }
      },
      "date_range": {
        "field": "@timestamp",
        "ranges": [
          {
            "from": "2020-09-01T00:00:00.000000+00:00"
          }
        ]
      }
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "agent.id": []
          }
        }
      ]
    }
  }
}

Results in:

{
  "took" : 962,
  "timed_out" : false,
  "_shards" : {
    "total" : 480,
    "successful" : 480,
    "skipped" : 477,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "date_range" : {
      "buckets" : [
        {
          "key" : "2020-09-01T00:00:00.000Z-*",
          "from" : 1.5989184E12,
          "from_as_string" : "2020-09-01T00:00:00.000Z",
          "doc_count" : 0,
          "unique_users" : {
            "value" : 0
          }
        }
      ]
    }
  }
}

And other times in

{
  "took" : 444,
  "timed_out" : false,
  "_shards" : {
    "total" : 480,
    "successful" : 480,
    "skipped" : 478,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "date_range" : {
      "buckets" : [
        {
          "key" : "2020-09-01T00:00:00.000Z-*",
          "from" : 1.5989184E12,
          "from_as_string" : "2020-09-01T00:00:00.000Z",
          "doc_count" : 0
        }
      ]
    }
  }
}

Notice the unique_users metric missing from the second response. It doesn't take more than a few tries to get both responses.

@noamkush noamkush added >bug needs:triage Requires assignment of a team area label labels Oct 26, 2020
@iverase iverase added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Oct 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 26, 2020
@nik9000
Copy link
Member

nik9000 commented Oct 26, 2020

This sounds like something I may have broken when I was reworking the timing of aggs. And it may be fixed by #64016 which is an extension of a bug that I fixed that looked similar.

@nik9000 nik9000 self-assigned this Oct 27, 2020
@nik9000
Copy link
Member

nik9000 commented Oct 27, 2020

I've created a local reproduction and confirmed that it still hits master. This happens when a few things line up:

  1. Some shards map the field that you are running the range on and some do not.
  2. You are querying enough shard to kick in the can_match phase (128).
  3. Your query matches no docs and we recognize this during the can_match phase.
  4. The shards that don't map the field for the range sort before those that do map the field for the range.

All of that causes us to run the "Range.Unmapped" special aggregator which mistakenly doesn't include its subAggs. I'll open a fix for it.

@nik9000
Copy link
Member

nik9000 commented Oct 27, 2020

It's a bit simpler than that, I just discovered: all you need is for the date_range to target an index that doesn't map the target field. That'll get you the same results. can_match comes up on demo.elastic.co because it has lots of shards and it makes the results more confusing because, well, some shards do map that field so you'd expect not to use the unmapped implementation. But can_match kicks in here to help and we confuse it with a bad aggregator impl.

nik9000 added a commit to nik9000/elasticsearch that referenced this issue Oct 27, 2020
Now that we're consistently using `cat_match` to filter which shards we
run on we can get this confusing case:
1. You have a search with, say, a range and a sub-agg.
2. That search has a query that `can_match` can recognize will match no
   docs. On *any* shard.
3. So we dutifully run it on a single shard so it can produce the
   "empty" aggs.
4. The shard we pick happens to not have the target of the range mapped.
5. This kicks in the special range aggregator that doesn't collect any
   documents.
6. Before this commit, that range aggregator *also* never produced any
   sub-aggs.

So, without this change, it was quite possible for a search that
happened to match no documents to "throw away" the sub-aggs of a range
and a few other aggs.

We've had this problem for a long, long time but it is more confusing
now because `can_match` is really kicking in and causing us to see cases
where it looks like you are targeting a lot of shards but you really are
only targeting a couple. It used to be that to get the "no sub-aggs"
behavior you had to explicitly target only shards that didn't map the
target field of the `range` agg. And, like, in that case it isn't too
bad because you targeted a sort of degenerate shard. But now that
`can_match` is doing its thing you can end up with the confusing steps
above. It took me several hours to track down what what happening I know
how the individual pieces of all of this works. It took four hours to
figure out how they fit together in this case....

Anyway! This replaces all the aggregator implementations that throw out
the sub-aggregators with ones that keep them. I think this'll be less
confusing in the future.

Closes elastic#64142
nik9000 added a commit that referenced this issue Oct 27, 2020
Now that we're consistently using `cat_match` to filter which shards we
run on we can get this confusing case:
1. You have a search with, say, a range and a sub-agg.
2. That search has a query that `can_match` can recognize will match no
   docs. On *any* shard.
3. So we dutifully run it on a single shard so it can produce the
   "empty" aggs.
4. The shard we pick happens to not have the target of the range mapped.
5. This kicks in the special range aggregator that doesn't collect any
   documents.
6. Before this commit, that range aggregator *also* never produced any
   sub-aggs.

So, without this change, it was quite possible for a search that
happened to match no documents to "throw away" the sub-aggs of a range
and a few other aggs.

We've had this problem for a long, long time but it is more confusing
now because `can_match` is really kicking in and causing us to see cases
where it looks like you are targeting a lot of shards but you really are
only targeting a couple. It used to be that to get the "no sub-aggs"
behavior you had to explicitly target only shards that didn't map the
target field of the `range` agg. And, like, in that case it isn't too
bad because you targeted a sort of degenerate shard. But now that
`can_match` is doing its thing you can end up with the confusing steps
above. It took me several hours to track down what what happening I know
how the individual pieces of all of this works. It took four hours to
figure out how they fit together in this case....

Anyway! This replaces all the aggregator implementations that throw out
the sub-aggregators with ones that keep them. I think this'll be less
confusing in the future.

Closes #64142
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Oct 27, 2020
…4214)

Now that we're consistently using `cat_match` to filter which shards we
run on we can get this confusing case:
1. You have a search with, say, a range and a sub-agg.
2. That search has a query that `can_match` can recognize will match no
   docs. On *any* shard.
3. So we dutifully run it on a single shard so it can produce the
   "empty" aggs.
4. The shard we pick happens to not have the target of the range mapped.
5. This kicks in the special range aggregator that doesn't collect any
   documents.
6. Before this commit, that range aggregator *also* never produced any
   sub-aggs.

So, without this change, it was quite possible for a search that
happened to match no documents to "throw away" the sub-aggs of a range
and a few other aggs.

We've had this problem for a long, long time but it is more confusing
now because `can_match` is really kicking in and causing us to see cases
where it looks like you are targeting a lot of shards but you really are
only targeting a couple. It used to be that to get the "no sub-aggs"
behavior you had to explicitly target only shards that didn't map the
target field of the `range` agg. And, like, in that case it isn't too
bad because you targeted a sort of degenerate shard. But now that
`can_match` is doing its thing you can end up with the confusing steps
above. It took me several hours to track down what what happening I know
how the individual pieces of all of this works. It took four hours to
figure out how they fit together in this case....

Anyway! This replaces all the aggregator implementations that throw out
the sub-aggregators with ones that keep them. I think this'll be less
confusing in the future.

Closes elastic#64142
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Oct 27, 2020
…4214)

Now that we're consistently using `cat_match` to filter which shards we
run on we can get this confusing case:
1. You have a search with, say, a range and a sub-agg.
2. That search has a query that `can_match` can recognize will match no
   docs. On *any* shard.
3. So we dutifully run it on a single shard so it can produce the
   "empty" aggs.
4. The shard we pick happens to not have the target of the range mapped.
5. This kicks in the special range aggregator that doesn't collect any
   documents.
6. Before this commit, that range aggregator *also* never produced any
   sub-aggs.

So, without this change, it was quite possible for a search that
happened to match no documents to "throw away" the sub-aggs of a range
and a few other aggs.

We've had this problem for a long, long time but it is more confusing
now because `can_match` is really kicking in and causing us to see cases
where it looks like you are targeting a lot of shards but you really are
only targeting a couple. It used to be that to get the "no sub-aggs"
behavior you had to explicitly target only shards that didn't map the
target field of the `range` agg. And, like, in that case it isn't too
bad because you targeted a sort of degenerate shard. But now that
`can_match` is doing its thing you can end up with the confusing steps
above. It took me several hours to track down what what happening I know
how the individual pieces of all of this works. It took four hours to
figure out how they fit together in this case....

Anyway! This replaces all the aggregator implementations that throw out
the sub-aggregators with ones that keep them. I think this'll be less
confusing in the future.

Closes elastic#64142
nik9000 added a commit that referenced this issue Oct 28, 2020
…64247)

Now that we're consistently using `cat_match` to filter which shards we
run on we can get this confusing case:
1. You have a search with, say, a range and a sub-agg.
2. That search has a query that `can_match` can recognize will match no
   docs. On *any* shard.
3. So we dutifully run it on a single shard so it can produce the
   "empty" aggs.
4. The shard we pick happens to not have the target of the range mapped.
5. This kicks in the special range aggregator that doesn't collect any
   documents.
6. Before this commit, that range aggregator *also* never produced any
   sub-aggs.

So, without this change, it was quite possible for a search that
happened to match no documents to "throw away" the sub-aggs of a range
and a few other aggs.

We've had this problem for a long, long time but it is more confusing
now because `can_match` is really kicking in and causing us to see cases
where it looks like you are targeting a lot of shards but you really are
only targeting a couple. It used to be that to get the "no sub-aggs"
behavior you had to explicitly target only shards that didn't map the
target field of the `range` agg. And, like, in that case it isn't too
bad because you targeted a sort of degenerate shard. But now that
`can_match` is doing its thing you can end up with the confusing steps
above. It took me several hours to track down what what happening I know
how the individual pieces of all of this works. It took four hours to
figure out how they fit together in this case....

Anyway! This replaces all the aggregator implementations that throw out
the sub-aggregators with ones that keep them. I think this'll be less
confusing in the future.

Closes #64142
nik9000 added a commit that referenced this issue Oct 29, 2020
…64244)

Now that we're consistently using `cat_match` to filter which shards we
run on we can get this confusing case:
1. You have a search with, say, a range and a sub-agg.
2. That search has a query that `can_match` can recognize will match no
   docs. On *any* shard.
3. So we dutifully run it on a single shard so it can produce the
   "empty" aggs.
4. The shard we pick happens to not have the target of the range mapped.
5. This kicks in the special range aggregator that doesn't collect any
   documents.
6. Before this commit, that range aggregator *also* never produced any
   sub-aggs.

So, without this change, it was quite possible for a search that
happened to match no documents to "throw away" the sub-aggs of a range
and a few other aggs.

We've had this problem for a long, long time but it is more confusing
now because `can_match` is really kicking in and causing us to see cases
where it looks like you are targeting a lot of shards but you really are
only targeting a couple. It used to be that to get the "no sub-aggs"
behavior you had to explicitly target only shards that didn't map the
target field of the `range` agg. And, like, in that case it isn't too
bad because you targeted a sort of degenerate shard. But now that
`can_match` is doing its thing you can end up with the confusing steps
above. It took me several hours to track down what what happening I know
how the individual pieces of all of this works. It took four hours to
figure out how they fit together in this case....

Anyway! This replaces all the aggregator implementations that throw out
the sub-aggregators with ones that keep them. I think this'll be less
confusing in the future.

Closes #64142
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants