aws_elasticache_replication_group logic does not permit automatic_failover_enabled when num_of_clusters = 1 #18522

DoctorPolski · 2021-04-01T11:41:52Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v0.14.8

provider registry.terraform.io/hashicorp/aws v3.34.0

Affected Resource(s)

aws_elasticache_replication_group

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

resource "aws_elasticache_replication_group" "elasticache_redis" {
  replication_group_id          = "${var.environment}-redis"
  replication_group_description = var.redis_replication_group_description
  node_type                     = var.redis_instance_type
  automatic_failover_enabled    = var.redis_automatic_failover_enabled
  multi_az_enabled              = var.redis_multi_az_enabled
  auto_minor_version_upgrade    = var.redis_auto_minor_version_upgrade
  availability_zones            = data.aws_subnet.redis[*].availability_zone
  at_rest_encryption_enabled    = var.redis_at_rest_encryption_enabled
  transit_encryption_enabled    = var.redis_transit_encryption_enabled
  engine_version                = var.redis_engine_version
  parameter_group_name          = var.redis_parameter_group_name
  port                          = var.redis_port
  subnet_group_name             = aws_elasticache_subnet_group.mds_elasticache_redis_subnet_group.name
  maintenance_window            = var.redis_maintenance_window
  snapshot_window               = var.redis_snapshot_window
  snapshot_retention_limit      = var.redis_snapshot_retention_limit

  security_group_ids = [
    aws_security_group.elasticache_redis_sg.id,
  ]

  cluster_mode {
    replicas_per_node_group = var.redis_replicas_per_node_group
    num_node_groups         = var.redis_num_node_groups
  }
}

Expected Behavior

automatic_failover_enabled = true
multi_az_enabled           = false

cluster_mode {
  replicas_per_node_group = 0
  num_node_groups         = 1
}

With the above config, Terraform should plan.

Actual Behavior

Error: if automatic_failover_enabled is true, number_cache_clusters must be greater than 1

Steps to Reproduce

terraform apply

Important Factoids

This working redis config is currently in place on AWS and has been for well over a year. This was configured via Terraform v0.12 with an obviously much earlier version of the provider.
Ever since the release of provider v3.26.0 Terraform will no longer plan despite this config not having changed.
The config suggestion made by @gdavison in aws_elasticache_replication_group setting multi-az to false when automatic_failover_enabled is true #17376 (comment) is already in place as illustrated above so is of no help here.
As mentioned here aws_elasticache_replication_group setting multi-az to false when automatic_failover_enabled is true #17376 (comment) and here aws_elasticache_replication_group setting multi-az to false when automatic_failover_enabled is true #17376 (comment) the issue remains that Terraform is insisting this:

If automatic_failover_enabled is enabled, number_cache_clusters must be greater than 1

Source: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/elasticache_replication_group#automatic_failover_enabled

Which does not agree with the working config I currently have on AWS:

So I have to question whether or not the assertion made by Terraform is correct. It is my belief that number of clusters DOES NOT need to be > 1 for automatic_failover_enabled to be TRUE.

I agree with @gdavison here #17605 (comment) that it makes no logical sense setting automatic_failover_enabled to true for a single node since indeed where would it failover to. But AWS permits this and therefore it is incorrect for Terraform enforce otherwise.

I have Serverless code dependencies on cluster mode being set so I cannot specify my single node environments without clustering mode enabled.

References

The text was updated successfully, but these errors were encountered:

choikijin-techcross · 2021-04-01T12:38:56Z

I have same problem.

#17605 (comment)
#17376 (comment)

I saw these 2 comments from related issues, so I feel like the developer of this code predicts that this terraform code works.

automatic_failover_enabled = true
multi_az_enabled           = false

cluster_mode {
  replicas_per_node_group = 0
  num_node_groups         = 1
}

unfortunately, I can't create smallest redis-cluster set @gdavison mentioned. I hope it'll be fixed soon.

gdavison · 2021-04-01T21:58:29Z

For clarification, is part of the issue that you haven't directly configured number_cache_clusters, yet the error message references it? If so, I can try to improve the error message. We currently allow configuring non-cluster mode replication groups using either number_cache_clusters or cluster_mode to make development and production configurations easier. number_cache_clusters == cluster_mode.num_node_groups * (1 + cluster_mode.replicas_per_node_group).

For your existing replication group, can you add a screenshot of the current parameters? In the list of replication groups, click on the triangle to show this. I'm wondering if there's an unexpected configuration, especially since AWS changed how some settings are interpreted.

choikijin-techcross · 2021-04-02T02:15:16Z

@gdavison I didn't configure number_cache_clusters and the error message is Error: if automatic_failover_enabled is true, number_cache_clusters must be greater than 1.

My setting is same as @DoctorPolski, so

cluster_mode.num_node_groups = 1,
cluster_mode.replicas_per_node_group = 0,
number_cache_clusters == 1 * (1 + 0) = 1

and it's not greater than 1.
So I think the calculation is working, but terraform doesn't allow me to use a smallest set of a cluster.

I didn't add number_cache_clusters in my code, but terraform plan is showing number_cache_clusters on the result.
Maybe terraform is putting number_cache_clusters internally.

In below terraform plan result doesn't show the change of replicas_per_node_group because I already removed my replica on my aws console.

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.redis_cluster.aws_elasticache_replication_group.redis_cluster will be updated in-place
  ~ resource "aws_elasticache_replication_group" "redis_cluster" {
        arn                            = "arn:aws:elasticache:ap-northeast-1:xxxxxx:replicationgroup:xxxxxx-redis-cluster"
        at_rest_encryption_enabled     = false
        auto_minor_version_upgrade     = true
      ~ automatic_failover_enabled     = true -> false
        cluster_enabled                = true
        configuration_endpoint_address = "xxxxxx-redis-cluster.gfqrwu.clustercfg.apne1.cache.amazonaws.com"
        engine                         = "redis"
        engine_version                 = "4.0.10"
        id                             = "xxxxxx-redis-cluster"
        maintenance_window             = "tue:05:00-tue:06:00"
        member_clusters                = [
            "xxxxxx-redis-cluster-0001-001",
        ]
        multi_az_enabled               = false
        node_type                      = "cache.t2.small"
        number_cache_clusters          = 1
        parameter_group_name           = "xxxxxx-redis-cluster-params"
        port                           = 6379
        replication_group_description  = "xxxxxx-redis-cluster"
        replication_group_id           = "xxxxxx-redis-cluster"
        security_group_ids             = [
            "sg-981aa5e0",
        ]
        security_group_names           = []
        snapshot_retention_limit       = 5
        snapshot_window                = "14:00-15:00"
        subnet_group_name              = "xxxxxx-redis-subnet-group"
        tags                           = {
            "AppRole" = "redis-cluster"
        }
        transit_encryption_enabled     = false

        cluster_mode {
            num_node_groups         = 1
            replicas_per_node_group = 0
        }
    }

Plan: 0 to add, 1 to change, 0 to destroy.

------------------------------------------------------------------------

$ grep -rn number_cache_clusters modules/elasticache/redis_cluster/

PS. I changed my terraform plan result to my last try to resolve this problem, and like as you think, this terraform code failed to apply because I changed my automatic_failover_enabled to false.

DoctorPolski · 2021-04-02T08:38:47Z

Thanks for engaging on this Graham.

The error message is a little confusing because effectively it only directly references the number_cache_clusters method for provisioning clusters. Honestly good luck thinking up a concise error message that references both...

These are the params for the cluster referenced in my OP:

I believe we may share the same suspicion that since AWS made their changes that brought about the need for your v3.26.0 update that my config may effectively be no longer possible to create? Could it therefore be that I and others have been left with an orphaned legacy cluster config as a result of those changes?

The only way to test this theory directly would be to attempt to recreate a similar cluster config from scratch using AWS console or CLI. Even if that turned out to be possible the mystery of being able to set automatic_failover_enabled to TRUE for a single node with nothing to failover to would still remain.

In either case I think a support request to AWS is required here in parallel to our investigations.

gdavison · 2021-04-05T20:02:36Z

I'll take a look at improving the error messaging. Part of the complication is that we allow setting either, number_cache_clusters or cluster_mode, and calculate the other. When the parameter validations are done, it's picking up the calculated number_cache_clusters.

I'll do some more investigation around the configuration and try to reproduce it. It theoretically should not be possible, at least now, unless there's also some interaction with "Cluster-Mode enabled"/sharding.

Right now, if the validation is turned off, the AWS API returns an error, but I haven't tried it with a cluster.on parameter group.

DoctorPolski · 2021-04-06T09:05:28Z

I have tested the creation of a "single node cluster" directly on the console and your suspicions are correct @gdavison, cluster.on does indeed make all the difference.

This is my creation template:

Selecting the Cluster Mode enabled Redis sub-option dynamically updates the Parameter group setting to include the cluster.on suffix.

Setting Number of Shards to 1 and Replicas per Shard to 0 forcibly unsets the Multi-AZ option and produces the blue info ribbon. We know the provider currently accounts for this as per previous discussion in this issue and related threads.

Resulting cluster:

In respect of clustering, this matches the config of my "small environments".

gdavison · 2021-04-06T17:47:33Z

Thanks for testing that, @DoctorPolski. That's not the result I expected… 😀

ghost · 2021-04-09T02:38:53Z

This has been released in version 3.36.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost · 2021-05-09T17:12:33Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

ghost added the service/elasticache Issues and PRs that pertain to the elasticache service. label Apr 1, 2021

github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Apr 1, 2021

anGie44 added thinking and removed needs-triage Waiting for first response or review from a maintainer. labels Apr 1, 2021

gdavison self-assigned this Apr 1, 2021

gdavison added the waiting-response Maintainers are waiting on response from community or contributor. label Apr 1, 2021

ghost removed waiting-response Maintainers are waiting on response from community or contributor. labels Apr 2, 2021

gdavison mentioned this issue Apr 7, 2021

Removes plan-time validation for automatic_failover_enabled #18635

Merged

gdavison removed the thinking label Apr 8, 2021

bflad closed this as completed in #18635 Apr 8, 2021

github-actions bot added this to the v3.36.0 milestone Apr 8, 2021

ghost locked as resolved and limited conversation to collaborators May 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_elasticache_replication_group logic does not permit automatic_failover_enabled when num_of_clusters = 1 #18522

aws_elasticache_replication_group logic does not permit automatic_failover_enabled when num_of_clusters = 1 #18522

DoctorPolski commented Apr 1, 2021 •

edited

Loading

choikijin-techcross commented Apr 1, 2021

gdavison commented Apr 1, 2021

choikijin-techcross commented Apr 2, 2021 •

edited

Loading

DoctorPolski commented Apr 2, 2021

gdavison commented Apr 5, 2021 •

edited

Loading

DoctorPolski commented Apr 6, 2021

gdavison commented Apr 6, 2021

ghost commented Apr 9, 2021

ghost commented May 9, 2021

aws_elasticache_replication_group logic does not permit automatic_failover_enabled when num_of_clusters = 1 #18522

aws_elasticache_replication_group logic does not permit automatic_failover_enabled when num_of_clusters = 1 #18522

Comments

DoctorPolski commented Apr 1, 2021 • edited Loading

Community Note

Terraform CLI and Terraform AWS Provider Version

Affected Resource(s)

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

choikijin-techcross commented Apr 1, 2021

gdavison commented Apr 1, 2021

choikijin-techcross commented Apr 2, 2021 • edited Loading

DoctorPolski commented Apr 2, 2021

gdavison commented Apr 5, 2021 • edited Loading

DoctorPolski commented Apr 6, 2021

This is my creation template:

Resulting cluster:

gdavison commented Apr 6, 2021

ghost commented Apr 9, 2021

ghost commented May 9, 2021

DoctorPolski commented Apr 1, 2021 •

edited

Loading

choikijin-techcross commented Apr 2, 2021 •

edited

Loading

gdavison commented Apr 5, 2021 •

edited

Loading