-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS - Segment Replication] SegRep consistency limitations #8700
Comments
I know why you're discussing remote store and system indexes here, but really these are just specific cases where the subtly different behavior of segrep compared to docrep causes problems. I believe our long term goal should be the removal of docrep and using segrep everywhere instead, and there will be users that run into the same problems as plugins if they have use cases that rely on the read-after-write behavior of get/mget and WAIT_UNTIL.
|
Thanks for the feedback @andrross. You are right this is really an overall limitation of segrep and not specific to remote store / system indices. I've trimmed this a bit to reflect this. For the immediate future I think primary based routing is the simplest solution until we have a streaming API. Will close this issue. |
Background:
Segment Replication (SegRep) currently has a known limitation in that it does not support strong read after write mechanisms available to Document Replication (DocRep) indices. These mechanisms are: a read request with get/multi-get by ID and writes with RefreshPolicy.WAIT_UNTIL followed by a search. Currently, the only way to achieve a strong read with Segment Replication is to use Preference.Prefer_Primary to route requests to primary shards.
The Problem:
The issue with these mechanisms with SegRep is they hold resources in memory. GET requires a “version map” to be maintained in the engine that maps doc ids to their translog location while writes with WAIT_UNTIL hold open listeners. With DocRep these resources can be cleared locally with a refresh when a limit is reached. With SegRep only primaries can issue a local refresh to clear these resources because replicas only refresh after receiving copied segments. This means we will continue to fill without bound until segment copy completes.
Issues where we explored supporting these existing mechanisms for context:
WAIT_UNTIL - #6045
GET/MGET - #8536
A streaming Index API can provide a resolution to this limitation by acknowledging a write request once a certain consistency level has been reached. However, until this exists and for get requests I’d like to list some ideas and start the discussion on how to deal with this. Pls do comment if there is any option I’ve missed. I think Option 1 provides the best shorter term solution until we have the streaming API for search.
1. Primary shard based reads
Internally route all get/mget requests to primary shards only if SegRep is enabled and the realtime param is true (default). More on this option provided in #8536 and clearly document that this could hurt performance in read heavy cases or if primary is overloaded. Require users to update to prefer _primary for any search that is currently following a WAIT_UNTIL write.
Pros:
Cons:
2. Do nothing
All requests requiring strong reads require client update to prefer primary shards with segment replication.
Pros:
Cons:
3. Constrained GET and WAIT_UNTIL requests.
In this approach we would update GET and WAIT_UNTIL by enforcing hard caps to safeguard against memory issues.
Get - For get/mget this means we would need to throttle writes until replicas are caught up so that the memory footprint mapping doc to translog location does not grow unbounded. This is similar to our SegRep backpressure mechanism today that enforces pressure when replica falls too far behind based on replication lag & checkpoint count. However, it would also include a primary computed memory threshold.
WAIT_UNTIL - We would put a hard cap on the amount of open wait_until writes per shard. In this case we would support the refresh policy but rather than solely triggering a local refresh to clear requests we would track the amount of open wait_until requests open to replicas and reject writes if this exceeds a limit.
Pros:
Cons:
The text was updated successfully, but these errors were encountered: