Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch Segment Replication [RFC] #1694

Closed
CEHENKLE opened this issue Dec 10, 2021 · 13 comments
Closed

OpenSearch Segment Replication [RFC] #1694

CEHENKLE opened this issue Dec 10, 2021 · 13 comments
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes

Comments

@CEHENKLE
Copy link
Member

CEHENKLE commented Dec 10, 2021

RFC - OpenSearch Segment Replication

Overview

The purpose of this document is to propose a new shard replication strategy (segment replication) and outline its tradeoffs with the existing approach (document replication). This is not a design document; it is a high level proposal to garner feedback from the OpenSearch community. It is also pertinent to note that the aim of this change is not to replace document replication with segment replication.

This RFC is a part of ongoing changes to revamp OpenSearch’s storage and data retrieval architecture. Subsequently, we would also like to cover updates to the transaction log (translog) and storage abstraction. As always, if you have ideas on how to improve these components or any others, we welcome your RFCs. As such, this RFC will only focus on replication strategies.

Document Replication #

All operations that affect an OpenSearch index (i.e., adding, updating, or removing documents) are routed to one of the index’s primary shards. The primary shard is then responsible for validating the operation and subsequently executing it locally. Once the operation has been completed successfully on the primary shard, the operation is forwarded to each of its replica shards in parallel. Each replica shard executes the operation, duplicating the processing performed on the primary shard. We refer to this replication strategy as Document Replication.

When the operation has completed (either successfully or with a failure) on every replica and a response has been received by the primary shard, the primary shard then responds to the client. This response includes information about replication success/failure to replica shards.

Segment Replication

In segment replication, we copy Lucene segment files from the primary shard to its replicas instead of having replicas re-execute the operation. Since Lucene uses a write-once segmented architecture (see FAQ # 1), only new segment files will need to be copied, since the existing ones will never change. This mechanism will adapt the NRT replication capabilities introduced in Lucene 6.0. We can also ensure that segment merging (see FAQ # 2) is only performed by the primary shard. Merged segments can then be copied over to replicas through Lucene’s warming APIs.

Note that this does not remove the need for a transaction log (see FAQ # 3) since segment replication does not force a commit to disk. We will retain the existing translog implementation, though this may present some caveats (see FAQ # 4). As noted in the overview, updates to the translog architecture will be covered by an upcoming RFC.

Feedback Requested (examples):

  1. Are there use cases for which segment replications cannot work? Or alternately, do you have use cases where segment replication is the ideal solution?
  2. What consistency and reliability properties would you like to see?
  3. How would you prefer to configure replication strategy? Per-index or on a cluster as a whole?
  4. Does point-in-time consistency matter to you? What use-cases benefit from this? What are the tradeoffs of using segment replication but having each shard perform segment merging independently?
  5. Are there any areas that are not captured by the tradeoffs discussed below?

Tradeoffs

Performance

In Document Replication, all replicas must do the same redundant processing as the primary node. This can present a bottleneck, especially when searching runs concurrently with intensive background operations such as segment merging. This is exacerbated on clusters that need many replicas to support high query rates.

Segment Replication avoids this redundant processing since replicas do not need to execute the operation, reducing CPU and memory utilization and improving performance. As an illustrative example, Yelp reported a 30% improvement in throughput after moving to a segment replication based approach.

That said, this improvement in throughput may come at the cost of increased network bandwidth usage. This is because in Segment Replication, we need to copy the segment files from the primary node to all replicas over the wire in addition to managing the translog. Merged segments will also need to be copied from the primary node to all replicas. This presents an increase in data sent over the network between shards compared to Document Replication, where only the original request is forwarded from the primary to replicas. Each replica is then responsible for its own operation processing and segment merging, with no further interaction between shards across the network.

In Segment Replication, there may also be an increased refresh latency i.e. delay before documents are visible for searching. This is because segment files must first be copied to replicas before the replicas can open new searchers.

As part of the design, we will validate these hypotheses and explore ways to offset any additional increase in bandwith.

Recovery

Recovery using Document Replication can be slow. This is because when a replica goes down for a while and then comes back up, it must replay (reindex) all newly indexed documents that arrived while it was down. This can easily be a very large number, thus taking a long time to catch up, or it must fallback to a costly full index copy. In comparison, Segment Replication typically only needs to copy the new index files.

Failover

Failover (promotion of a replica to a primary) in Segment Replication can be slower than Document Replication in the worst-case scenario i.e. when the primary shard fails before segments have been copied over to replicas. In this situation, the replicas will need to fall back to replaying operations from the translog.

Shard Allocation

With Segment Replication, primary nodes perform more work than the replicas. The shard allocation algorithm would need to change to ensure that the cluster does not become imbalanced. On failover, replica nodes may suddenly do more work (if the translog were replayed), causing hotspots. In contrast, Document Replication does not have such complications since all shards perform an identical amount of work, though redundant.

Consistency

Both replication strategies prioritize consistency over availability. Document Replication is more susceptible to availability drops due to transient errors when a replica operation throws an exception after the primary shard has successfully processed the operation. However, Segment Replicaiton will require more guardrails to ensure file validity when segment files are copied over. This is because only the primary shard performs indexing and merge operations. If the primary somehow writes a broken file, this would be copied to all replicas. Further, replicas will need mechanisms to verify the consistency of segment files transferred over the network.

Segment Replication also offers point-in-time consistency since it ensures that Lucene indexes ((i.e. OpenSearch shards) are identical throughout a replication group (primary shard and all its replicas) . This ensures that performance across shards remains consistent. In Document Replication, the primary and replicas operate on and merge segment files independently, so they may be searching slightly different and incomparable views of the index (though the data contained within is identical).

Next steps

Based on the feedback from this RFC, we’ll be drafting a design (link to be added). We look forward to collaborating with the community.

FAQs

1. How is OpenSearch Indexing structured?
An OpenSearch index is comprised of multiple shards, each of which is a Lucene index. Each Lucene index is, in turn, made up of one or more immutable index segments. Each segment is essentially a "mini-index" in itself. When you perform a search on OpenSearch, each shard uses Lucene to search across its segments, filter out any deletions, and merge the results from all segments.

2. What is segment merging?
Due to Lucene's "write-once" semantics, more and more segment files are created as more and more data is indexed. Also, deleting a document only marks the document as deleted and does not immediately free up the space it occupies. Searching through a large number of segments is inefficient because Lucene must search them in sequence, not in parallel. Thus, from time to time, Lucene automatically merges smaller segments into larger ones to keep search efficient, and to permanently remove deleted documents. Merging is a performance intensive operation.

3. What is the translog (transaction log)?
OpenSearch uses a Transaction Log (a.k.a translog) to store all operations that have been performed on the shard but have not yet been committed to the file system. Each shard, whether primary or replica, has its own translog, which is committed to disk every time an operation is recorded.

An OpenSearch flush is the process of performing a Lucene commit and starting a new translog generation. In the event of a crash, recent operations that are not yet included in the last Lucene commit are replayed from the translog when the shard recovers to bring it up to date.

4. Is there a caveat of keeping the existing translog implementation when Segment Replication is in use?
Yes. Since we now perform two separate operations on replicas (segment replication and translog updates), we cannot guarantee the ordering of these operations together. Thus, there is a possibility that the translog and the segment files on a replica will be out of sync with each other.

Note that with segment replication, there is no need for a translog on replica shards since they receive segment files directly from their primary shards. However, there is still a need for a translog per replication group in case the primary shard becomes unavailable. Strategies to address this will be covered by an upcoming RFC.

5. Will it be possible to change an existing index’s replication strategy?
No, it is not possible to switch an existing index’s replication strategy between Document Replication and Segment Replication. Segment Replication assumes that all shards in a replication group contain identical segment files. This is not a requirement for document replication since each shard creates segments and merges independently.

References

@CEHENKLE CEHENKLE added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 10, 2021
@CEHENKLE CEHENKLE pinned this issue Dec 10, 2021
@anasalkouz anasalkouz added RFC Issues requesting major changes discuss Issues intended to help drive brainstorming and decision making distributed framework and removed untriaged labels Dec 10, 2021
@itiyamas
Copy link
Contributor

This is a great starting point for segment replication.

What are the tradeoffs of using segment replication but having each shard perform segment merging independently?

A major cost of indexing is merging and I am sure that the performance gains seen are also due to savings on merge costs. Additionally, merging is a write operation and if you want to segregate readers and writers at some point- merging belongs with the writers. Agreed that there are ways to do merging on every node, but that defeats the purpose of segment replication. The only downside to merging would be where the network cost to copy increasingly bigger segments after merging turns out be more than the CPU cost to merge the segments- seems like a poor trade-off to me given the current hardware costs. The only exception is cases where replicas are situated in different geographical regions- for such clusters even translog copy would be expensive.

Point in time consistency

I may have clusters that have lower read throughput requirement, but keep replicas only for redundancy. For such clusters, I would not block refreshes on primary, but keep the refreshes going faster and read from primary only. Hence, I would expose point in time consistency as an optional setting - would not even keep that as default.

Since we now perform two separate operations on replicas (segment replication and translog updates), we cannot guarantee the ordering of these operations together. Thus, there is a possibility that the translog and the segment files on a replica will be out of sync with each other.

Can you elaborate a bit more on the potential issues here? I don't think segment replication breaks any existing guarantee here. Even with logical replication, there is no guarantee of atomicity or ordering during reads. With writes, ordering is maintained since the ordering of translog operations is always determined by primary and can be easily verified by checking primary term+ seq number combination. With segment replication as well, the primary assigns seq number to operations, writes them to lucene buffer and translog and then sends the operation to replica- which then writes it to translog. Even with logical replication- there is no guarantee that a particular read would contain all documents in order since lucene writes to different thread local buffer objects and each buffer object is flushed independently, which is then read- hence no ordering guarantee here.

Agree that asynchrouous segments received from primary in replica could be behind or ahead of translog operations in replica. However, each commit received from primary could contain metadata(sequence number and term- both max continuous and max) about operations contained upto that commit- which could potentially help with determining which tlog operations to redrive during failovers.

Wondering if you would ever provide hooks to use an external store implementation as a replication mechanism.

@reta
Copy link
Collaborator

reta commented Dec 13, 2021

Great start indeed @CEHENKLE , just a comment regarding segment merging:

Searching through a large number of segments is inefficient because Lucene must search them in sequence, not in parallel.

This is not entirely true, Lucene does support parallel search over segments, and on OpenSearch side we are working on experimental support of that functionality as well (#1500)

@Bukhtawar
Copy link
Collaborator

Looks like a good first step. Thanks for the doc

Document Replication is more susceptible to availability drops due to transient errors when a replica operation throws an exception after the primary shard has successfully processed the operation

Note that with segment replication, there is no need for a translog on replica shards since they receive segment files directly from their primary shards. However, there is still a need for a translog per replication group in case the primary shard becomes unavailable

Are they based on the assumption that with segment based replication, we would not replicate operations(transient errors would still be present as long as we replicate)? I guess we would still need to replicate translog operations, unless there is a centralised durable store, but that's for a separate discussion.

Yes. Since we now perform two separate operations on replicas (segment replication and translog updates), we cannot guarantee the ordering of these operations together.

I guess we have sequence ids for total ordering on index operations which should still be applicable for segment based replication. The translogs are only used for recovering uncommitted operations denoted by their sequence ids

Each shard, whether primary or replica, has its own translog, which is committed to disk every time an operation is recorded.

Operations in translogs are committed before the request(containing one or more operations) can be acknowledged back

In Segment Replication, there may also be an increased refresh latency i.e. delay before documents are visible for searching. This is because segment files must first be copied to replicas before the replicas can open new searchers.

Just to add, request landing independently on primary and replicas can also see a read skew. Note that this is still possible in document based replication

@CEHENKLE
Copy link
Member Author

Great start indeed @CEHENKLE , just a comment regarding segment merging:

Searching through a large number of segments is inefficient because Lucene must search them in sequence, not in parallel.

This is not entirely true, Lucene does support parallel search over segments, and on OpenSearch side we are working on experimental support of that functionality as well (#1500)

Ooo, cool! :) We were writing about the current state of the released software, but that's a good point.

@kartg
Copy link
Member

kartg commented Jan 24, 2022

@itiyamas

Agree that asynchrouous segments received from primary in replica could be behind or ahead of translog operations in replica.

Yes, that's what we were referring to as "ordering of operations". Here, the two operations are writing to the replica's translog, and copying of segments to the replica. Today, we are guaranteed to write to the translog only after the operation has been successfully processed. This guarantee will not hold with segment replication since copying of segments is asynchronous and decoupled from the translog replication code path.

Wondering if you would ever provide hooks to use an external store implementation as a replication mechanism.

I think this falls into an area of overlap between this proposal and #1968 :) We'll make sure to take this into consideration when we finalize the design and define the API contracts


@Bukhtawar

Are they based on the assumption that with segment based replication, we would not replicate operations(transient errors would still be present as long as we replicate)? I guess we would still need to replicate translog operations, unless there is a centralised durable store, but that's for a separate discussion.

Answering the first part of your question - the assertion around availability drops is based on the fact that with segment replication, replicas do not need to re-process the operation, removing the possibility of transient errors. The latter point on translog was conceptual - with segment replication, we only need a translog per replication group rather than per shard. As you pointed out, this would need a centralized store, which is a separate discussion. For our default implementaiton, we will continue to use the current translog structure i.e. a translog stored with each shard.

I guess we have sequence ids for total ordering on index operations which should still be applicable for segment based replication. The translogs are only used for recovering uncommitted operations denoted by their sequence ids

Covered by my reply to @itiyamas. Hopefully that clears up the point being covered here :)

Just to add, request landing independently on primary and replicas can also see a read skew. Note that this is still possible in document based replication

Agreed. We didn't call it out since this is a possibility with document-based replication as well.

@anasalkouz
Copy link
Member

Thanks everyone for your feedback, here is the summary of callouts:

  • Merged segments should be in the scope of Segment Replication, since it contribute to a major part of the indexing cost, and this will be step further of separating reader and writer JVMs.
  • Point in time consistency could be as an optional setting or we could also explore the option of getting critical queries from primary shards only.
  • Segment replication need translog per replication group rather than per shard, unless we are using remote storage, there is another proposal to discuss it [Feature Proposal] Add Remote Storage Options for Improved Durability #1968 . We'll make sure to take this into consideration when we finalize the design and define the API contracts

We will keep this RFC issue open for one more week (until 02/18/2022) and then close this discussion. A new issue will be opened next, which would cover the design aspects.

@kartg
Copy link
Member

kartg commented Feb 24, 2022

Issue tracking development of design proposal - #2229

Meta issue on POC development - #2194

@CEHENKLE CEHENKLE unpinned this issue Mar 1, 2022
@sachinpkale
Copy link
Member

Segment replication should benefit current snapshot mechanism. Today, in the case of failover, snapshot is taken for the entire data of new primary. As the segment files would be same, we can continue using incremental way. Thoughts?

@mch2
Copy link
Member

mch2 commented Mar 24, 2022

Hi @sachinpkale. I think this is a reasonable strategy with pushing incremental updates to a remote store as part of #1968. On the replica side we will need to support fetching the incremental diffs directly from the primary or optionally a remote store.

@kartg
Copy link
Member

kartg commented Mar 29, 2022

Design Proposal - #2229

@elfisher
Copy link

@CEHENKLE it looks like the experimental version of this is going into 2.2. Should I create a card in the roadmap for that?

@CEHENKLE
Copy link
Member Author

I'm down. @mch2 do you want to give me some text to explain exactly what's coming in for this release? We could salt in additional cards for future releases.

@elfisher
Copy link

@mch2 beat us to making the experimental release. 🎉 #3969

I've added it to the roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes
Projects
None yet
Development

No branches or pull requests

9 participants