Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic tie-breaking for sorted queries within a PIT #65450

Closed
wants to merge 4 commits into from

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Nov 24, 2020

This change generates a tiebreaker automatically for sorted queries that are executed
under a PIT (point in time reader). This allows to paginate consistently over the matching documents without
requiring to provide a sort criteria that is unique per document.
The tiebreaker is automatically added as the last sort values of the search hits in the response.
It is then used by search_after to ensure that pagination will not miss any documents and that each document
will appear only once.
This commit also allows queries sorted by internal Lucene id (_doc) to be optimized if they are executed
under a PIT the same way than scroll queries.

Closes #56828

This change generates a tiebreaker automatically for sorted queries that are executed
under a PIT (point in time reader). This allows to paginate consistently over the matching documents without
requiring to provide a sort criteria that is unique per document.
The tiebreaker is automatically added as the last sort values of the search hits in the response.
It is then used by `search_after` to ensure that pagination will not miss any documents and that each document
will appear only once.
This commit also allows queries sorted by internal Lucene id (`_doc`) to be optimized if they are executed
under a PIT the same way than scroll queries.

Closes elastic#56828
@jimczi jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.11.0 labels Nov 24, 2020
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Nov 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@@ -71,9 +71,9 @@ To get the first page of results, submit a search request with a `sort`
argument. If using a PIT, specify the PIT ID in the `pit.id` parameter and omit
the target data stream or index from the request path.

IMPORTANT: We recommend you include a tiebreaker field in your `sort`. This
tiebreaker field should contain a unique value for each document. If you don't
include a tiebreaker field, your paged results could miss or duplicate hits.
Copy link
Contributor

@mayya-sharipova mayya-sharipova Nov 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user doesn't use PIT in a sort request? For this case should we still keep this recommendation of adding a tie-breaking field to sort fields?
Or is our official recommendation always use PIT with sort?

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jimczi. A great addition to sort!!! I have 2 main questions just to better understand how things work:

  1. Does PIT always have the same shards and in the same order? Even if PID changes?
  2. What is our story about _asc sort VS _desc sort? Usually when we search in _desc sort we get docs in the opposite order from what we search in _asc sort. With this new tie breaking field this does not happen: docs with equal sort fields come up in the same increasing order of shard/_docId regardless _asc or _desc sort. And this looks unusual. This could matter for example for a backwards pagination, as we don't have search_before, we suggested users to use search_after with a reverse sort; but this strategy will not work for them anymore in tie breaking search as they would be getting the same documents.

private boolean canReturnNullResponseIfMatchNoDocs;
private SearchSortValuesAndFormats bottomSortValues;

private int shardIndex = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make shardIndex field final and always required?

a PIT, the response's `pit_id` parameter contains an updated PIT ID.
a PIT, the response's `pit_id` parameter contains an updated PIT ID and a tiebreaker
is included as the last `sort` values for each hit. This tiebreaker is a unique value for each document that allows
consistent pagination within a `pit_id`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to add more details about what this tie_breaker is made of, e.g. "tie_breaker is a combination of shard index and internal _doc id." ?

InternalAggregations aggregations, SearchProfileShardResults shardResults, SortedTopDocs sortedTopDocs,
DocValueFormat[] sortValueFormats, int numReducePhases, int size, int from, boolean isEmptyResult) {
// <code>true</code> if the search request uses a point in time reader
final boolean hasPIT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space

@@ -52,6 +52,8 @@
private final TimeValue searchContextKeepAlive;
private final PlainIterator<String> targetNodesIterator;

private int searchShardIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the default value be -1 similar to ShardSearchRequest?

if (Sort.RELEVANCE.equals(sort) || shardIndex == fieldDocShard) {
fieldDoc.doc = decodeDocID(tiebreaker);
} else if (shardIndex < fieldDocShard) {
fieldDoc.doc = Integer.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be worth to add more comments explaining the meaning of a certain value for a field doc. e.g.
fieldDoc.doc = Integer.MAX_VALUE; // skipping docs of this shard with equal field values

@jimczi
Copy link
Contributor Author

jimczi commented Dec 1, 2020

Thanks for looking @mayya-sharipova .

Does PIT always have the same shards and in the same order? Even if PID changes?

The order depends on the primary sort so it's consistent for the same sort order. We don't need to use the same ordering for the execution of the shards and the tiebreaker so I opened #65706. That will make the shard index that we use for tie-breaking consistent between requests that target the same shards (even if they change the sort order).

What is our story about _asc sort VS _desc sort? Usually when we search in _desc sort we get docs in the opposite order from what we search in _asc sort. With this new tie breaking field this does not happen: docs with equal sort fields come up in the same increasing order of shard/_docId regardless _asc or _desc sort. And this looks unusual. This could matter for example for a backwards pagination, as we don't have search_before, we suggested users to use search_after with a reverse sort; but this strategy will not work for them anymore in tie breaking search as they would be getting the same documents.

Good catch, I opened #65706 to create a consistent shard index so I am going to close this PR. I'll open a new one that will allow to set the _tiebreaker in the sort criteria directly when #65706 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Virtual Sort field for automatic tie-breaking
4 participants