-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenSearch Segment Replication [RFC] #1694
Comments
This is a great starting point for segment replication.
A major cost of indexing is merging and I am sure that the performance gains seen are also due to savings on merge costs. Additionally, merging is a write operation and if you want to segregate readers and writers at some point- merging belongs with the writers. Agreed that there are ways to do merging on every node, but that defeats the purpose of segment replication. The only downside to merging would be where the network cost to copy increasingly bigger segments after merging turns out be more than the CPU cost to merge the segments- seems like a poor trade-off to me given the current hardware costs. The only exception is cases where replicas are situated in different geographical regions- for such clusters even translog copy would be expensive.
I may have clusters that have lower read throughput requirement, but keep replicas only for redundancy. For such clusters, I would not block refreshes on primary, but keep the refreshes going faster and read from primary only. Hence, I would expose point in time consistency as an optional setting - would not even keep that as default.
Can you elaborate a bit more on the potential issues here? I don't think segment replication breaks any existing guarantee here. Even with logical replication, there is no guarantee of atomicity or ordering during reads. With writes, ordering is maintained since the ordering of translog operations is always determined by primary and can be easily verified by checking primary term+ seq number combination. With segment replication as well, the primary assigns seq number to operations, writes them to lucene buffer and translog and then sends the operation to replica- which then writes it to translog. Even with logical replication- there is no guarantee that a particular read would contain all documents in order since lucene writes to different thread local buffer objects and each buffer object is flushed independently, which is then read- hence no ordering guarantee here. Agree that asynchrouous segments received from primary in replica could be behind or ahead of translog operations in replica. However, each commit received from primary could contain metadata(sequence number and term- both max continuous and max) about operations contained upto that commit- which could potentially help with determining which tlog operations to redrive during failovers. Wondering if you would ever provide hooks to use an external store implementation as a replication mechanism. |
Great start indeed @CEHENKLE , just a comment regarding segment merging:
This is not entirely true, Lucene does support parallel search over segments, and on OpenSearch side we are working on experimental support of that functionality as well (#1500) |
Looks like a good first step. Thanks for the doc
Are they based on the assumption that with segment based replication, we would not replicate operations(transient errors would still be present as long as we replicate)? I guess we would still need to replicate translog operations, unless there is a centralised durable store, but that's for a separate discussion.
I guess we have sequence ids for total ordering on index operations which should still be applicable for segment based replication. The translogs are only used for recovering uncommitted operations denoted by their sequence ids
Operations in translogs are committed before the request(containing one or more operations) can be acknowledged back
Just to add, request landing independently on primary and replicas can also see a read skew. Note that this is still possible in document based replication |
Ooo, cool! :) We were writing about the current state of the released software, but that's a good point. |
Yes, that's what we were referring to as "ordering of operations". Here, the two operations are writing to the replica's translog, and copying of segments to the replica. Today, we are guaranteed to write to the translog only after the operation has been successfully processed. This guarantee will not hold with segment replication since copying of segments is asynchronous and decoupled from the translog replication code path.
I think this falls into an area of overlap between this proposal and #1968 :) We'll make sure to take this into consideration when we finalize the design and define the API contracts
Answering the first part of your question - the assertion around availability drops is based on the fact that with segment replication, replicas do not need to re-process the operation, removing the possibility of transient errors. The latter point on translog was conceptual - with segment replication, we only need a translog per replication group rather than per shard. As you pointed out, this would need a centralized store, which is a separate discussion. For our default implementaiton, we will continue to use the current translog structure i.e. a translog stored with each shard.
Covered by my reply to @itiyamas. Hopefully that clears up the point being covered here :)
Agreed. We didn't call it out since this is a possibility with document-based replication as well. |
Thanks everyone for your feedback, here is the summary of callouts:
We will keep this RFC issue open for one more week (until 02/18/2022) and then close this discussion. A new issue will be opened next, which would cover the design aspects. |
Segment replication should benefit current snapshot mechanism. Today, in the case of failover, snapshot is taken for the entire data of new primary. As the segment files would be same, we can continue using incremental way. Thoughts? |
Hi @sachinpkale. I think this is a reasonable strategy with pushing incremental updates to a remote store as part of #1968. On the replica side we will need to support fetching the incremental diffs directly from the primary or optionally a remote store. |
Design Proposal - #2229 |
@CEHENKLE it looks like the experimental version of this is going into 2.2. Should I create a card in the roadmap for that? |
I'm down. @mch2 do you want to give me some text to explain exactly what's coming in for this release? We could salt in additional cards for future releases. |
RFC - OpenSearch Segment Replication
Overview
The purpose of this document is to propose a new shard replication strategy (segment replication) and outline its tradeoffs with the existing approach (document replication). This is not a design document; it is a high level proposal to garner feedback from the OpenSearch community. It is also pertinent to note that the aim of this change is not to replace document replication with segment replication.
This RFC is a part of ongoing changes to revamp OpenSearch’s storage and data retrieval architecture. Subsequently, we would also like to cover updates to the transaction log (translog) and storage abstraction. As always, if you have ideas on how to improve these components or any others, we welcome your RFCs. As such, this RFC will only focus on replication strategies.
Document Replication #
All operations that affect an OpenSearch index (i.e., adding, updating, or removing documents) are routed to one of the index’s primary shards. The primary shard is then responsible for validating the operation and subsequently executing it locally. Once the operation has been completed successfully on the primary shard, the operation is forwarded to each of its replica shards in parallel. Each replica shard executes the operation, duplicating the processing performed on the primary shard. We refer to this replication strategy as Document Replication.
When the operation has completed (either successfully or with a failure) on every replica and a response has been received by the primary shard, the primary shard then responds to the client. This response includes information about replication success/failure to replica shards.
Segment Replication
In segment replication, we copy Lucene segment files from the primary shard to its replicas instead of having replicas re-execute the operation. Since Lucene uses a write-once segmented architecture (see FAQ # 1), only new segment files will need to be copied, since the existing ones will never change. This mechanism will adapt the NRT replication capabilities introduced in Lucene 6.0. We can also ensure that segment merging (see FAQ # 2) is only performed by the primary shard. Merged segments can then be copied over to replicas through Lucene’s warming APIs.
Note that this does not remove the need for a transaction log (see FAQ # 3) since segment replication does not force a commit to disk. We will retain the existing translog implementation, though this may present some caveats (see FAQ # 4). As noted in the overview, updates to the translog architecture will be covered by an upcoming RFC.
Feedback Requested (examples):
Tradeoffs
Performance
In Document Replication, all replicas must do the same redundant processing as the primary node. This can present a bottleneck, especially when searching runs concurrently with intensive background operations such as segment merging. This is exacerbated on clusters that need many replicas to support high query rates.
Segment Replication avoids this redundant processing since replicas do not need to execute the operation, reducing CPU and memory utilization and improving performance. As an illustrative example, Yelp reported a 30% improvement in throughput after moving to a segment replication based approach.
That said, this improvement in throughput may come at the cost of increased network bandwidth usage. This is because in Segment Replication, we need to copy the segment files from the primary node to all replicas over the wire in addition to managing the translog. Merged segments will also need to be copied from the primary node to all replicas. This presents an increase in data sent over the network between shards compared to Document Replication, where only the original request is forwarded from the primary to replicas. Each replica is then responsible for its own operation processing and segment merging, with no further interaction between shards across the network.
In Segment Replication, there may also be an increased refresh latency i.e. delay before documents are visible for searching. This is because segment files must first be copied to replicas before the replicas can open new searchers.
As part of the design, we will validate these hypotheses and explore ways to offset any additional increase in bandwith.
Recovery
Recovery using Document Replication can be slow. This is because when a replica goes down for a while and then comes back up, it must replay (reindex) all newly indexed documents that arrived while it was down. This can easily be a very large number, thus taking a long time to catch up, or it must fallback to a costly full index copy. In comparison, Segment Replication typically only needs to copy the new index files.
Failover
Failover (promotion of a replica to a primary) in Segment Replication can be slower than Document Replication in the worst-case scenario i.e. when the primary shard fails before segments have been copied over to replicas. In this situation, the replicas will need to fall back to replaying operations from the translog.
Shard Allocation
With Segment Replication, primary nodes perform more work than the replicas. The shard allocation algorithm would need to change to ensure that the cluster does not become imbalanced. On failover, replica nodes may suddenly do more work (if the translog were replayed), causing hotspots. In contrast, Document Replication does not have such complications since all shards perform an identical amount of work, though redundant.
Consistency
Both replication strategies prioritize consistency over availability. Document Replication is more susceptible to availability drops due to transient errors when a replica operation throws an exception after the primary shard has successfully processed the operation. However, Segment Replicaiton will require more guardrails to ensure file validity when segment files are copied over. This is because only the primary shard performs indexing and merge operations. If the primary somehow writes a broken file, this would be copied to all replicas. Further, replicas will need mechanisms to verify the consistency of segment files transferred over the network.
Segment Replication also offers point-in-time consistency since it ensures that Lucene indexes ((i.e. OpenSearch shards) are identical throughout a replication group (primary shard and all its replicas) . This ensures that performance across shards remains consistent. In Document Replication, the primary and replicas operate on and merge segment files independently, so they may be searching slightly different and incomparable views of the index (though the data contained within is identical).
Next steps
Based on the feedback from this RFC, we’ll be drafting a design (link to be added). We look forward to collaborating with the community.
FAQs
1. How is OpenSearch Indexing structured?
An OpenSearch index is comprised of multiple shards, each of which is a Lucene index. Each Lucene index is, in turn, made up of one or more immutable index segments. Each segment is essentially a "mini-index" in itself. When you perform a search on OpenSearch, each shard uses Lucene to search across its segments, filter out any deletions, and merge the results from all segments.
2. What is segment merging?
Due to Lucene's "write-once" semantics, more and more segment files are created as more and more data is indexed. Also, deleting a document only marks the document as deleted and does not immediately free up the space it occupies. Searching through a large number of segments is inefficient because Lucene must search them in sequence, not in parallel. Thus, from time to time, Lucene automatically merges smaller segments into larger ones to keep search efficient, and to permanently remove deleted documents. Merging is a performance intensive operation.
3. What is the translog (transaction log)?
OpenSearch uses a Transaction Log (a.k.a translog) to store all operations that have been performed on the shard but have not yet been committed to the file system. Each shard, whether primary or replica, has its own translog, which is committed to disk every time an operation is recorded.
An OpenSearch flush is the process of performing a Lucene commit and starting a new translog generation. In the event of a crash, recent operations that are not yet included in the last Lucene commit are replayed from the translog when the shard recovers to bring it up to date.
4. Is there a caveat of keeping the existing translog implementation when Segment Replication is in use?
Yes. Since we now perform two separate operations on replicas (segment replication and translog updates), we cannot guarantee the ordering of these operations together. Thus, there is a possibility that the translog and the segment files on a replica will be out of sync with each other.
Note that with segment replication, there is no need for a translog on replica shards since they receive segment files directly from their primary shards. However, there is still a need for a translog per replication group in case the primary shard becomes unavailable. Strategies to address this will be covered by an upcoming RFC.
5. Will it be possible to change an existing index’s replication strategy?
No, it is not possible to switch an existing index’s replication strategy between Document Replication and Segment Replication. Segment Replication assumes that all shards in a replication group contain identical segment files. This is not a requirement for document replication since each shard creates segments and merges independently.
References
The text was updated successfully, but these errors were encountered: