Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Update shard allocation to evenly distribute primaries. #5240

Closed
3 of 4 tasks
Tracked by #5147
mch2 opened this issue Nov 14, 2022 · 9 comments
Closed
3 of 4 tasks
Tracked by #5147
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request v2.6.0 'Issues and PRs related to version v2.6.0'

Comments

@mch2
Copy link
Member

mch2 commented Nov 14, 2022

With segment replication primary shards will be using more node resources than replicas. While still net less than docrep, this will lead to an uneven utilization of resources across the cluster.

We need to explore updating allocation with segrep enabled to evenly balance primary shards across a cluster.

@dreamer-89
Copy link
Member

Looking into it

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 4, 2023

Tried a quick exercise to see existing shard allocation behavior. All shards were distributed evenly but not primary alone. This is due to existing EnableAllocationDecider but even after enabling rebalance based on primaries the result did not change. Digging more into existing ShardAllocation mechanism.

Create 6 indices with SEGMENT as replication type.

{
  "settings": {
    "index": {
      "number_of_shards": 5,  
      "number_of_replicas": 1,
      "replication.type": "SEGMENT" 
    }
  }
}

The primary shard allocation on nodes remains uneven, even after using cluster.routing.rebalance.enable setting.

➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-1" | grep " p " | wc -l
      13
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-2" | grep " p " | wc -l
      13
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-0" | grep " p " | wc -l
       9

➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-0"  | wc -l     
      24
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-1"  | wc -l
      23
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-2"  | wc -l
      23

I retried the exercise after enabling rebalancing on primaries but that did not change the result.

{
  "persistent" : {
    "cluster.routing.rebalance.enable": "primaries"
  }
}

@dreamer-89 dreamer-89 self-assigned this Jan 4, 2023
@dreamer-89
Copy link
Member

cluster.routing.rebalance.enable gates which shard type can be rebalance but doesn't govern the actual allocation. The existing allocation is based on shard count (irrespective of primary/replica) and is govern by WeightFunction inside BalancedShardAllocator

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 6, 2023

There are couple of approaches to solve this:

1. Handle Segment replication enabled shards differently

Add new parameters in WeightFunction primaryBalance and primaryIndexBalance which is used for weight calculation of segment replication enabled shards.

primaryBalance -> Number of primary shards on a node
primaryIndexBalance -> Number of primary shards on a node of specific index

Pros

  • Does not change existing balancing logic and thus less intrusive

Cons

  • Two different weight calculations in cluster may not result in even balance.
  • Complicated logic to collate different shard type balancing

2. Change allocation logic for all type of shards

Change existing WeightFunction for all type of shards. The algorithm introduces another primaryBalance factor to ensure even primary shard distribution followed by replicas.

Pros:

  • Unifies logic for allocation not just for SegRep but also for RemoteStore and more.

Cons:

  • Changing the core logic may cause unintended regression and end user behavior change

[Edit]: Based on above approach 2 seems promising as it simpler and doesn't handle SegRep indices separately and thus, avoid collation of two different algorithms for balancing. This approach can be made less intrusive by not using this new algorithm by default.

@Bukhtawar
Copy link
Collaborator

I think we need to associate default weights per shard based on various criteria f.e primary/replica or remote/local such that these weights are representative of the multi-dimensional work(compute/memory/IO/disk) they do relative to one another. This will ensure we are able to tune these vectors based on the heat dynamically as well(long term), once indices turn read only or requests distribution shifts on different shards based on custom routing logic.

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 11, 2023

I think we need to associate default weights per shard based on various criteria f.e primary/replica or remote/local such that these weights are representative of the multi-dimensional work(compute/memory/IO/disk) they do relative to one another. This will ensure we are able to tune these vectors based on the heat dynamically as well(long term), once indices turn read only or requests distribution shifts on different shards based on custom routing logic.

Thanks @Bukhtawar for the feedback and the suggestion. This is definitely useful in larger scheme of things that we may need to implement. It will need more thoughts, discussion and probably overhaul of existing allocation. I am thinking of starting with updating existing BalancedShardAllocator and tune weight function to incorporate segment replication semantics. We can re-iterate and make model more robust in follow up. Please feel free to open an issue (with more details?) and we can discuss more.

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 12, 2023

Requirement

Ensure even distribution of same index primaries (and replicas). Replicas also need even distribution because primaries for an index may do variant amount of work (hot shards), so does corresponding replica.

Goals

As part of phase 1, targeting below goals for this task.

  1. Shard distribution resulting in overall balance
  2. Minimal side effect on rebalancing
  3. Ensure no negative impact.
  • Transient or permanent unassinged shards (red cluster).
  • Comparatively higher shard movements post allocation & rebalancing. Node join & drop etc

Assumption

Balancing primary shards evenly for docrep indices does not impact the cluster performance.

Approach

As overall goal is to have uniform resource utilization among nodes containing various type of shards, we need to use same scale to measure different shard types. Using separate allocation logic for different shard types, working in isolation will NOT work! I think it make sense to update the existing weightFunction factoring in primary shards which solves use case here and is simpler to start with. This will be the initial work to evolve the existing weighing function and incorporate different factors which might need allocation logic overhaul. As discussed here, proceeding with Approach 2 which introduces a new primary shard balance factor.

Future Improvement

The weight function can be updated to cater future needs, where different shards have in-built weight against an attribute (or weighing factor). E.g. primary with more replicas (higher fan out) should have higher weight compared to other primaries; so that weight function prioritize even distribution of these primaries FIRST.

POC

Tried a POC which introduces a new setting to balance shards based on primary shard count and corresponding update of WeightFunction.

[Edit]: This can still result in skewness on segrep shards i.e. non-balanced primary segrep shards on node though overall primary shard balance is within threshold. Thanks @mch2 for pointing this.

There are two already existing balance factors. This POC adds primaryWeightShard below to consider primary shards for allocation

  1. INDEX_BALANCE_FACTOR_SETTING defines shard count per node for an index. This ensures shards belonging to an index are distributed across nodes.
  2. SHARD_BALANCE_FACTOR_SETTING defines shard count per node. This quantifies total number of shards per node is balances across the cluster.
  3. PRIMARY_BALANCE_FACTOR_SETTING defines primary shard count per node. This ensures primary shards per node is balanced.
float weight(ShardsBalancer balancer, ModelNode node, String index) {
final float weightShard = node.numShards() - balancer.avgShardsPerNode();
final float weightIndex = node.numShards(index) - balancer.avgShardsPerNode(index);
final float primaryWeightShard = node.numPrimaryShards() - balancer.avgPrimaryShardsPerNode(); // primary balance factor

return theta0 * weightShard + theta1 * weightIndex + theta2 * primaryWeightShard;
        }

There is no single value for PRIMARY_BALANCE_FACTOR_SETTING which distributes shards satisfying all three balance factors for every possible cluster configuration. This is true even today with only INDEX_BALANCE_FACTOR_SETTING and SHARD_BALANCE_FACTOR_SETTING. A higher value for primary shard balance factor may result in primary balance but not necessarily equal shards per node. This probably needs some analysis on distribution of shards for different constant weight factors. Added a subtask in description.

@nknize @Bukhtawar @kotwanikunal @mch2 : Request for feedback

@dreamer-89
Copy link
Member

dreamer-89 commented Feb 4, 2023

For subtask 2 mentioned here

Previous change in #6017 introduces primary shard balance factor but doesn't differentiate on shard types (docrep vs segrep); with end-result of overall balanced primary shard distribution but not necessarily for individual shard types. Example, the nodes removal results in 4 primary unassigned shards (docrep 2 and segrep 2), there are chances of one node getting both docrep shard while other getting segrep shards. Changing this logic to accomodate segrep index is not required because:

  1. LocalShardsBalancer sorts priortises primary first, for allocation which means balanced distribution across nodes. Also, it is not possible to create multiple indices at once, so it always deals with single indices at a time.
  2. LocalShardsBalancer logic is meant for newly created indices only. Failed shard is handled by RoutingNodes#failShard which promotes in-sync replica copy (if any); while existing shard allocator (GatewayAllocator) promotes an upto date version of shard as primary
  3. The rebalancing logic is still applied post failover scenario (in step 2) to keep balanced primary shard distribution after [Segment Replication] Introduce primary weight factor for primary shards distribution #6017

For logging purpose, added an unsuccessful test to mimic scenario where LocalShardsBalancer allocation logic is applied here

@dreamer-89
Copy link
Member

dreamer-89 commented Feb 6, 2023

Tracking remaining work in #6210 for benchmarking, guidance on default value and a single sane default value (if possible). Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request v2.6.0 'Issues and PRs related to version v2.6.0'
Projects
Status: Done
Development

No branches or pull requests

3 participants