Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478

Closed
Tracked by #2194 ...
Bukhtawar opened this issue Sep 10, 2022 · 7 comments · Fixed by #6674
Closed
Tracked by #2194 ...
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@Bukhtawar
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Once we enable segment based replication for an index, we wouldn't be indexing any operation on the replica(just writing to translog for durability). Just by virtue of having a successful write to translog we would assume that the replica is caught up. However, since no indexing operation is applied on replicas except the segments on checkpoint refresh, it's possible that the replica may not have successfully processed the checkpoint for a while due to shard overload/slow I/O would still be serving reads.
Currently there are no additional mechanisms(once translog has been written on the replica) to apply back pressure on primary if the replica is slow in processing checkpoints which would be aggravated with remote translog since there wouldn't be any I/O on replica at all since remote translog writes on primary will handle durability altogether.

Describe the solution you'd like
Need to support mechanisms to apply back pressure and as a last resort fail the replica copy if its unable to process any further checkpoint beyond a threshold

Describe alternatives you've considered

Additional context
Add any other context or screenshots about the feature request here.

@mch2
Copy link
Member

mch2 commented Feb 1, 2023

@Bukhtawar Have put together a POC that applies some backpressure as replicas fall behind based on some arbitrary doc count limit. Would like your thoughts:

At a high level we can update the write path to return data from each replica on its current status. I'm leaning toward a seqNo indicating a replica's "visible" checkpoint. The primary can then store this data and use it to determine if backpressure should be applied based on some configurable setting. We also include a another setting to configure when pressure is applied based on % of replicas that are 'stale.' I'm determining if a replica is 'stale' if its latest visible seqNo, or if actively syncing the seqNo it is syncing to is more than n checkpoints behind the primary. The primary then updates its local state after each copy event completes. I prefer n checkpoints behind to an arbitrary doc count, given the size of checkpoints can vary.

Some additional thoughts:

  • We should also consider marking shards as stale if replication times out. Right now we fail the replica and recover again if timeout occurs, but this could hit an infinite loop given timeout setting & checkpoint size, leaving the shard initializing over and over. At a minimum here I think we need to ensure files that have been copied remain on disk so we can stop & resume recovery/replication.

If going seqNo Route, some detailed steps/tasks:

Update IndexShard's ReplicationTracker to store what I'm calling a "visibleCheckpoint" to capture the latest seqNo that a replica has received.
Update the replicaResponse on the write path to return the latest visible seqNo.
Update ReplicationOperation#updateCheckPoints to update the returned visible seqNo in the primary's replicationTracker.
Update IndexShard to store the last published replication checkpoint, update PublishCheckpointAction to set this value.
Create a new field in ShardIndexingPressure similar to ShardIndexingPressureMemoryManager, I've called it SegmentReplicationIndexingPressureManager, that is responsible for determining if a request should be rejected on the primary.
Implement SegmentReplicationIndexingPressureManager with logic to reject requests if any replica is behind. Comparing the latestPublishedCheckpoint on the primary to the visible seqNo's available on other shards. Add a new setting that is used to determine this threshhold.

@mch2
Copy link
Member

mch2 commented Mar 1, 2023

I've put up #6520 to start capturing metrics we can use for applying pressure. This includes checkpoints behind, bytes behind, and some average replication times. I'm leaning toward a combination of breaching thresh holds for all three.

'm debating if we should apply pressure globally to a node or for individual replication groups. I think rejecting on the entire node could be useful if driven by the sum of current bytes replicas are behind, given all will consume shared resources to copy segments. Though this would mean heavier indices would impact those that are lighter.

I'm leaning toward rejecting within a replication group, given the purpose for rejection is to allow replicas to catch up, and rely on existing throttle settings & index backpressure to preserve node resources. @dreamer-89, @Bukhtawar curious what you two think here.

@mch2 mch2 linked a pull request Mar 2, 2023 that will close this issue
6 tasks
@mch2
Copy link
Member

mch2 commented Mar 2, 2023

Am thinking of this simple algorithm for applying pressure in SR based on replica state:

  1. On an incoming request to a primary fetch the SegmentReplicationStats introduced with Compute Segment Replication stats for SR backpressure. #6520.
  2. Compute the amount of stale shards from fetched data. A shard is stale if it is behind by n checkpoints and is currently syncing to a checkpoint for over y mins. Both n and y are configurable as settings, I'm thinking a default n = 4, default y = 5 mins. This would be lenient enough to not block when primary is frequently publishing small checkpoints, and strict enough to prevent replicas from falling too far behind based on time. I think a reasonable addition is to also include total bytes behind as an expert setting.
  3. If an individual shard is over 2*y we mark it for failure so that it is not serving reads, meaning applying our pressure did not help catch the shard up in a reasonable timeframe.
  4. If more than a certain % of the replication group is stale, we block the req only for that replication group. Thinking this would default to 50%.

I think this is a reasonable best effort to prevent replicas from falling too far behind until we have a streaming API where we can control our ingestion rate based on these metrics.

@Bukhtawar
Copy link
Collaborator Author

From what I understand we are starting with failing the shard first rather than putting a backpressure i.e. disallowing primaries to take in more write requests and allowing lagging replicas a cool-off. If backpressure isn't helping with alleviating the pain within a bounded interval, we can fail the shard copy considering it to be the reason for bottleneck and knowing that we cannot allow replicas to fall too much behind without blocking incoming writes

@mch2
Copy link
Member

mch2 commented Mar 6, 2023

From what I understand we are starting with failing the shard first rather than putting a backpressure i.e. disallowing primaries to take in more write requests and allowing lagging replicas a cool-off. If backpressure isn't helping with alleviating the pain within a bounded interval, we can fail the shard copy considering it to be the reason for bottleneck and knowing that we cannot allow replicas to fall too much behind without blocking incoming writes

@Bukhtawar I think we are on the same page here though your first sentence is the opposite. We will apply pressure first, and if the replica is not able to cool off & catch up within a bounded interval it will be failed.

@mch2
Copy link
Member

mch2 commented Mar 9, 2023

Will add 1 more issue here linked to this to actually fail replicas. I suggest we have a background task that periodically (30s or so) fetches the stats introduced with #6520 and fails any replicas in the RG that are behind. @Bukhtawar wdyt?

@Rishikesh1159
Copy link
Member

Closing this issue as last pending task of failing lagging/stale replica is merged with this PR: #6850

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
Status: Done
4 participants