[Segment Replication] Update shard allocation to evenly distribute primaries. #5240

mch2 · 2022-11-14T19:49:19Z

With segment replication primary shards will be using more node resources than replicas. While still net less than docrep, this will lead to an uneven utilization of resources across the cluster.

We need to explore updating allocation with segrep enabled to evenly balance primary shards across a cluster.

Add primary shard balance setting in the cluster. This is needed to provide mechanism for balancing cluster based on primary shards. [Segment Replication] Introduce primary weight factor for primary shards distribution #6017
Prefer segment replication enabled indices first for allocation and rebalance. This is to avoid skewness of segrep shards on specific nodes.
Perform analysis on reasonable defaults to be used for balanced cluster.
[Segment Replication] Allocation changes to distribute primaries - Benchmark & document improvements using opensearch-benchmark. #6210

dreamer-89 · 2023-01-03T21:03:25Z

Looking into it

dreamer-89 · 2023-01-04T18:51:28Z

Tried a quick exercise to see existing shard allocation behavior. All shards were distributed evenly but not primary alone. This is due to existing EnableAllocationDecider but even after enabling rebalance based on primaries the result did not change. Digging more into existing ShardAllocation mechanism.

Create 6 indices with SEGMENT as replication type.

{
  "settings": {
    "index": {
      "number_of_shards": 5,  
      "number_of_replicas": 1,
      "replication.type": "SEGMENT" 
    }
  }
}

The primary shard allocation on nodes remains uneven, even after using cluster.routing.rebalance.enable setting.

➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-1" | grep " p " | wc -l
      13
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-2" | grep " p " | wc -l
      13
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-0" | grep " p " | wc -l
       9

➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-0"  | wc -l     
      24
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-1"  | wc -l
      23
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-2"  | wc -l
      23

I retried the exercise after enabling rebalancing on primaries but that did not change the result.

{
  "persistent" : {
    "cluster.routing.rebalance.enable": "primaries"
  }
}

dreamer-89 · 2023-01-06T17:04:21Z

cluster.routing.rebalance.enable gates which shard type can be rebalance but doesn't govern the actual allocation. The existing allocation is based on shard count (irrespective of primary/replica) and is govern by WeightFunction inside BalancedShardAllocator

dreamer-89 · 2023-01-06T22:14:13Z

There are couple of approaches to solve this:

1. Handle Segment replication enabled shards differently

Add new parameters in WeightFunction primaryBalance and primaryIndexBalance which is used for weight calculation of segment replication enabled shards.

primaryBalance -> Number of primary shards on a node
primaryIndexBalance -> Number of primary shards on a node of specific index

Pros

Does not change existing balancing logic and thus less intrusive

Cons

Two different weight calculations in cluster may not result in even balance.
Complicated logic to collate different shard type balancing

2. Change allocation logic for all type of shards

Change existing WeightFunction for all type of shards. The algorithm introduces another primaryBalance factor to ensure even primary shard distribution followed by replicas.

Pros:

Unifies logic for allocation not just for SegRep but also for RemoteStore and more.

Cons:

Changing the core logic may cause unintended regression and end user behavior change

[Edit]: Based on above approach 2 seems promising as it simpler and doesn't handle SegRep indices separately and thus, avoid collation of two different algorithms for balancing. This approach can be made less intrusive by not using this new algorithm by default.

Bukhtawar · 2023-01-07T17:50:21Z

I think we need to associate default weights per shard based on various criteria f.e primary/replica or remote/local such that these weights are representative of the multi-dimensional work(compute/memory/IO/disk) they do relative to one another. This will ensure we are able to tune these vectors based on the heat dynamically as well(long term), once indices turn read only or requests distribution shifts on different shards based on custom routing logic.

dreamer-89 · 2023-01-11T23:11:43Z

I think we need to associate default weights per shard based on various criteria f.e primary/replica or remote/local such that these weights are representative of the multi-dimensional work(compute/memory/IO/disk) they do relative to one another. This will ensure we are able to tune these vectors based on the heat dynamically as well(long term), once indices turn read only or requests distribution shifts on different shards based on custom routing logic.

Thanks @Bukhtawar for the feedback and the suggestion. This is definitely useful in larger scheme of things that we may need to implement. It will need more thoughts, discussion and probably overhaul of existing allocation. I am thinking of starting with updating existing BalancedShardAllocator and tune weight function to incorporate segment replication semantics. We can re-iterate and make model more robust in follow up. Please feel free to open an issue (with more details?) and we can discuss more.

dreamer-89 · 2023-01-12T22:05:30Z

Requirement

Ensure even distribution of same index primaries (and replicas). Replicas also need even distribution because primaries for an index may do variant amount of work (hot shards), so does corresponding replica.

Goals

As part of phase 1, targeting below goals for this task.

Shard distribution resulting in overall balance
Minimal side effect on rebalancing
Ensure no negative impact.

Transient or permanent unassinged shards (red cluster).
Comparatively higher shard movements post allocation & rebalancing. Node join & drop etc

Assumption

Balancing primary shards evenly for docrep indices does not impact the cluster performance.

Approach

As overall goal is to have uniform resource utilization among nodes containing various type of shards, we need to use same scale to measure different shard types. Using separate allocation logic for different shard types, working in isolation will NOT work! I think it make sense to update the existing weightFunction factoring in primary shards which solves use case here and is simpler to start with. This will be the initial work to evolve the existing weighing function and incorporate different factors which might need allocation logic overhaul. As discussed here, proceeding with Approach 2 which introduces a new primary shard balance factor.

Future Improvement

The weight function can be updated to cater future needs, where different shards have in-built weight against an attribute (or weighing factor). E.g. primary with more replicas (higher fan out) should have higher weight compared to other primaries; so that weight function prioritize even distribution of these primaries FIRST.

POC

Tried a POC which introduces a new setting to balance shards based on primary shard count and corresponding update of WeightFunction.

[Edit]: This can still result in skewness on segrep shards i.e. non-balanced primary segrep shards on node though overall primary shard balance is within threshold. Thanks @mch2 for pointing this.

There are two already existing balance factors. This POC adds primaryWeightShard below to consider primary shards for allocation

INDEX_BALANCE_FACTOR_SETTING defines shard count per node for an index. This ensures shards belonging to an index are distributed across nodes.
SHARD_BALANCE_FACTOR_SETTING defines shard count per node. This quantifies total number of shards per node is balances across the cluster.
PRIMARY_BALANCE_FACTOR_SETTING defines primary shard count per node. This ensures primary shards per node is balanced.

float weight(ShardsBalancer balancer, ModelNode node, String index) {
final float weightShard = node.numShards() - balancer.avgShardsPerNode();
final float weightIndex = node.numShards(index) - balancer.avgShardsPerNode(index);
final float primaryWeightShard = node.numPrimaryShards() - balancer.avgPrimaryShardsPerNode(); // primary balance factor

return theta0 * weightShard + theta1 * weightIndex + theta2 * primaryWeightShard;
        }

There is no single value for PRIMARY_BALANCE_FACTOR_SETTING which distributes shards satisfying all three balance factors for every possible cluster configuration. This is true even today with only INDEX_BALANCE_FACTOR_SETTING and SHARD_BALANCE_FACTOR_SETTING. A higher value for primary shard balance factor may result in primary balance but not necessarily equal shards per node. This probably needs some analysis on distribution of shards for different constant weight factors. Added a subtask in description.

@nknize @Bukhtawar @kotwanikunal @mch2 : Request for feedback

dreamer-89 · 2023-02-04T03:47:41Z

For subtask 2 mentioned here

Previous change in #6017 introduces primary shard balance factor but doesn't differentiate on shard types (docrep vs segrep); with end-result of overall balanced primary shard distribution but not necessarily for individual shard types. Example, the nodes removal results in 4 primary unassigned shards (docrep 2 and segrep 2), there are chances of one node getting both docrep shard while other getting segrep shards. Changing this logic to accomodate segrep index is not required because:

LocalShardsBalancer sorts priortises primary first, for allocation which means balanced distribution across nodes. Also, it is not possible to create multiple indices at once, so it always deals with single indices at a time.
LocalShardsBalancer logic is meant for newly created indices only. Failed shard is handled by RoutingNodes#failShard which promotes in-sync replica copy (if any); while existing shard allocator (GatewayAllocator) promotes an upto date version of shard as primary
The rebalancing logic is still applied post failover scenario (in step 2) to keep balanced primary shard distribution after [Segment Replication] Introduce primary weight factor for primary shards distribution #6017

For logging purpose, added an unsuccessful test to mimic scenario where LocalShardsBalancer allocation logic is applied here

dreamer-89 · 2023-02-06T22:10:16Z

Tracking remaining work in #6210 for benchmarking, guidance on default value and a single sane default value (if possible). Closing this issue.

mch2 mentioned this issue Nov 14, 2022

[Meta] Promote Segment Replication out of experimental. #5147

Closed

16 tasks

dreamer-89 self-assigned this Jan 4, 2023

dreamer-89 mentioned this issue Jan 25, 2023

[Segment Replication] Introduce primary weight factor for primary shards distribution #6017

Merged

6 tasks

dreamer-89 mentioned this issue Feb 6, 2023

[DOC] Add documentation for primary shard based balancing opensearch-project/documentation-website#2663

Closed

4 tasks

dreamer-89 added the v2.6.0 'Issues and PRs related to version v2.6.0' label Feb 6, 2023

dreamer-89 closed this as completed Feb 6, 2023

dreamer-89 added enhancement Enhancement or improvement to existing feature or request distributed framework labels Feb 6, 2023

dreamer-89 mentioned this issue Feb 15, 2023

[Segment Replication] Consider primary shard balancing first #6325

Closed

6 tasks

seut mentioned this issue May 7, 2024

Improve primary shards balancing/reduce primary shard write overhead crate/crate#15919

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Update shard allocation to evenly distribute primaries. #5240

[Segment Replication] Update shard allocation to evenly distribute primaries. #5240

mch2 commented Nov 14, 2022 •

edited

Loading

dreamer-89 commented Jan 3, 2023

dreamer-89 commented Jan 4, 2023 •

edited

Loading

dreamer-89 commented Jan 6, 2023

dreamer-89 commented Jan 6, 2023 •

edited

Loading

Bukhtawar commented Jan 7, 2023

dreamer-89 commented Jan 11, 2023 •

edited

Loading

dreamer-89 commented Jan 12, 2023 •

edited

Loading

dreamer-89 commented Feb 4, 2023 •

edited

Loading

dreamer-89 commented Feb 6, 2023 •

edited

Loading

[Segment Replication] Update shard allocation to evenly distribute primaries. #5240

[Segment Replication] Update shard allocation to evenly distribute primaries. #5240

Comments

mch2 commented Nov 14, 2022 • edited Loading

dreamer-89 commented Jan 3, 2023

dreamer-89 commented Jan 4, 2023 • edited Loading

dreamer-89 commented Jan 6, 2023

dreamer-89 commented Jan 6, 2023 • edited Loading

1. Handle Segment replication enabled shards differently

2. Change allocation logic for all type of shards

Bukhtawar commented Jan 7, 2023

dreamer-89 commented Jan 11, 2023 • edited Loading

dreamer-89 commented Jan 12, 2023 • edited Loading

Requirement

Goals

Assumption

Approach

Future Improvement

POC

dreamer-89 commented Feb 4, 2023 • edited Loading

For subtask 2 mentioned here

dreamer-89 commented Feb 6, 2023 • edited Loading

mch2 commented Nov 14, 2022 •

edited

Loading

dreamer-89 commented Jan 4, 2023 •

edited

Loading

dreamer-89 commented Jan 6, 2023 •

edited

Loading

dreamer-89 commented Jan 11, 2023 •

edited

Loading

dreamer-89 commented Jan 12, 2023 •

edited

Loading

dreamer-89 commented Feb 4, 2023 •

edited

Loading

dreamer-89 commented Feb 6, 2023 •

edited

Loading