Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Revisit network timeout settings to avoid timeout exceptions #6027

Closed
dreamer-89 opened this issue Jan 26, 2023 · 4 comments
Assignees
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 26, 2023

Coming from discussion, as part of peer recovery a round of segment replication is performed to ensure the target is upto date with primary. For long running recovery process, it may be possible that segment replication request may time out. This is applicable for both primary and replica recovery process.

Related

#5242
#5313

@mch2
Copy link
Member

mch2 commented Feb 9, 2023

I think all settings in RecoverySettings should be looked at, but a few other node level settings in particular where I think Segrep should have its own config & defaults:

INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING - current default is 40MB/s
INDICES_RECOVERY_MAX_CONCURRENT_FILE_CHUNKS_SETTING - current default is 2.

We currently reuse these from recovery given the recovery code to send files was refactored to be reusable for segrep.

@dreamer-89 dreamer-89 self-assigned this Feb 27, 2023
@dreamer-89 dreamer-89 added discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request and removed bug Something isn't working labels Mar 1, 2023
@dreamer-89
Copy link
Member Author

Segment Replication uses the recovery settings for its operations. It is difficult to segregate these settings from Recovery because existing recovery settings are defined at NodeScope level and some are used to ensure restricted usage of Node's resources. E.g. max_bytes_per_sec (INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING) is shared b/w recovery and segment replication (default 40MB/s) via a RateLimiter object. I think adding a similar new setting for segment replication will be problematic as:

  1. Combined node level max_bytes_per_sec value will be higher than 40MB/s. It is also true the other way round where both continue to use the same setting and complete 40MB/s wouldn't be reserved for recovery alone. This needs to be called out in documentation issue [DOC] Segment Replication - GA release documentation-website#2255
  2. It is confusing to have separate settings controlling same underlying resource (Node here).

The other settings can be refactored into a separate SegmentReplicationSettings class but due to Recovery & SegRep both using common classes & pattern, using single settings makes sense.

Previous Issue opened for timeout: #4392

@dreamer-89
Copy link
Member Author

Addressed concern of using hard-coded constant for identifying the request timeout for fetching segment files in #6523.

This issue was opened to ensure segrep timeouts doesn't result in persistent recovery failures. Setting the max time for fetching files should address the timeout concern.

@dreamer-89
Copy link
Member Author

dreamer-89 commented Mar 2, 2023

Closing this issue as known timeout related issue is addressed in #6523
We can come back to this issue if there are any issues identified during benchmarking to be done post feature complete, tracked in #5147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
Status: Done
Development

No branches or pull requests

4 participants