-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Review Cross Cluster Replication compatibility with Segment Replication Enabled #3823
Comments
Tested running CCR plugin (Gradle check/integTests) against all segment replication changes. All the testing to verify this is done on local machine. Setup: Findings: Next Steps: |
/cc : @ankitkala |
Yep. Overall testing looks good to me. Segment replication for replica doesn't break logical cross-cluster-replication. |
Hey @Rishikesh1159 , I had few followup questions.
|
Hey @ankitkala, sorry for late response. Thanks, for point out few things. Actually I realized when running integ test suite of CCR plugin, I didn't have segment replication enabled. There is no setting for segment replication to set it on for all indices by default. We have to manually add index setting There is no way for us now to just enable segrep in one place and run all of integTest suite of CCR plugin. But instead we have to add enable setting in each of integTest/integTest class manually before we run it and then we can succesfully test that CCR works fine with seg-rep enabled. Sorry, for missing this part before, I will start testing with the 2 scenarios you mentioned in above comment. |
Manually testing each integTest is no longer needed after trying Anikit's suggestion of enabling seg-rep as default index setting. |
I think you build OS locally such that Feature flag |
Thanks @ankitkala for suggestion. I tried setting seg-rep as default index type and ran integTests of CCR. There are lot of test cases that are failing and most of test cases are failing with same failures. Here are few failures:
|
Can you also check the cluster logs. |
So I was trying to manually test the CCR with Segment replication on leader. I was able to test out the happy case for the 2 scenarios:
However, there were 2 issues that i observed which I've called out below. But i was able to test out the end to end flow irregardless of these issues. As a next step, I'll verify that all CCR Integration test should pass with SegRep enabled indices. Issue 1. SegRep on system index Since we were creating segment replication enabled indices by default for testing, CCR's system index had issues where read after writes were inconsistent. I was able to get around that by enforcing logical replication on the system index. But hopefully this issues would be resolved by the time we configure Segment as default replication type.
Issue 2. Segment Replication failing on the follower index These are the Segment Replication failures I observed on the follower cluster. I don't think these are side effect of doing cross cluster replication though.
|
Thanks @ankitkala for sharing the error stack trace. I am able to reproduce the errors above by running integ tests I noticed both primary & replica uses ReplicationEngine (extends InternalEngine from OpenSearch) though NRTReplicationEngine is needed on replica shards for SegmentReplication to work. |
Tried to update the engine factory lambda opensearch-project/cross-cluster-replication#486
|
The tests were failing because of Code Branches UsedOpenSearch: dreamer-89@14c48e2 (feature_flag on engine, hard-coding index type to SEGMENT) |
I re-ran the tests with fix above; along with changes captured in above branches. The tests now passed successfully.
@ankitkala : As existing tests seem to work, I will be closing this issue. Please let me know if there are any open issues. |
Great! Thanks @dreamer-89 . Can you also share a snapshot of the test result? |
Thanks @ankitkala. Uploaded the report below. Instance: c5.24xlarge For visibility pasting the report snapshot as well. |
Thanks. Looks good. |
@ankitkala : From segment replication compatibility perspective, I have only ran the integ tests ( Also, I would suggest you want to try out integ and manual tests on your end before we close out this issue. CC @mch2 |
@ankitkala : Just wanted to check if you get a chance to verify CCR and segment replication are compatible and working as expected. |
Hey, apologies for the delayed response. I was able to test out the changes from following branches and CCR tests did run successfully. However I did observed that the replicas on the follower didn't load
|
Thanks @ankitkala for sharing this. As discussed separately, I will try to run Basic Integration Test to repro above. I will update with my findings. |
I think the issue is that when we stop the replication for CCR, we close and reopen the index to reload the Engine. This is leading to the stacktrace above where |
What is the state of the shard during the refresh listener's publish? We should be blocking publish here if the shard is closed? Maybe we need another check on the shard state in the action? |
Apologies for delay on this task; got occupied with Error trace
|
This error appears because when index is closed the NoOpEngine is returned inside IndicesService.java. TransportNodesListGatewayStartedShards transport (used to fetch shard infos) recently updated to fetch ReplicationCheckpoint when index has segment replication enabled. This checkpoint is currently fetched from IndexShard.getLatestReplicationCheckpoint, which asserts engine should be InternalEngine which doesn’t hold true for closed indices. Added a fix after which all tests pass. There are test failures but they are not reproducible when run in isolation. Used latest changes from 2.x. Branches used for repro
Segment replication specific Below test fails on
|
I tried running the tests using OpenSearch |
@mch2 I again tested the compatibility between segrep & CCR.
Changes done to test: |
Since the new changes to follower shard are sent by one of the leader shard randomly (between primary and replicas), I also verified that leader's replica shards are also able to provide the translog operations. We can close this issue now. Few additional changes we'll need to do in future:
|
Is your feature request related to a problem? Please describe.
As part of this issue, we want to identify if Segment Replication enabled cluster have any issues with Cross Cluster Replication. In case there are issues, identify changes needed to make it compatible.
Describe the solution you'd like
-> Run clusters in local with segment replication enabled and test if CCR (Cross Cluster Replication) is working as expected and did not break.
-> Add an integration test with segment replication enabled and CCR, make sure this test passes.
The text was updated successfully, but these errors were encountered: