Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_elasticache_replication_group logic does not permit automatic_failover_enabled when num_of_clusters = 1 #18522

Closed
DoctorPolski opened this issue Apr 1, 2021 · 9 comments · Fixed by #18635
Assignees
Labels
service/elasticache Issues and PRs that pertain to the elasticache service.
Milestone

Comments

@DoctorPolski
Copy link

DoctorPolski commented Apr 1, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v0.14.8

  • provider registry.terraform.io/hashicorp/aws v3.34.0

Affected Resource(s)

  • aws_elasticache_replication_group

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

resource "aws_elasticache_replication_group" "elasticache_redis" {
  replication_group_id          = "${var.environment}-redis"
  replication_group_description = var.redis_replication_group_description
  node_type                     = var.redis_instance_type
  automatic_failover_enabled    = var.redis_automatic_failover_enabled
  multi_az_enabled              = var.redis_multi_az_enabled
  auto_minor_version_upgrade    = var.redis_auto_minor_version_upgrade
  availability_zones            = data.aws_subnet.redis[*].availability_zone
  at_rest_encryption_enabled    = var.redis_at_rest_encryption_enabled
  transit_encryption_enabled    = var.redis_transit_encryption_enabled
  engine_version                = var.redis_engine_version
  parameter_group_name          = var.redis_parameter_group_name
  port                          = var.redis_port
  subnet_group_name             = aws_elasticache_subnet_group.mds_elasticache_redis_subnet_group.name
  maintenance_window            = var.redis_maintenance_window
  snapshot_window               = var.redis_snapshot_window
  snapshot_retention_limit      = var.redis_snapshot_retention_limit

  security_group_ids = [
    aws_security_group.elasticache_redis_sg.id,
  ]

  cluster_mode {
    replicas_per_node_group = var.redis_replicas_per_node_group
    num_node_groups         = var.redis_num_node_groups
  }
}

Expected Behavior

automatic_failover_enabled = true
multi_az_enabled           = false

cluster_mode {
  replicas_per_node_group = 0
  num_node_groups         = 1
}

With the above config, Terraform should plan.

Actual Behavior

Error: if automatic_failover_enabled is true, number_cache_clusters must be greater than 1

Steps to Reproduce

  1. terraform apply

Important Factoids

  1. This working redis config is currently in place on AWS and has been for well over a year. This was configured via Terraform v0.12 with an obviously much earlier version of the provider.
  2. Ever since the release of provider v3.26.0 Terraform will no longer plan despite this config not having changed.
  3. The config suggestion made by @gdavison in aws_elasticache_replication_group setting multi-az to false when automatic_failover_enabled is true #17376 (comment) is already in place as illustrated above so is of no help here.
  4. As mentioned here aws_elasticache_replication_group setting multi-az to false when automatic_failover_enabled is true #17376 (comment) and here aws_elasticache_replication_group setting multi-az to false when automatic_failover_enabled is true #17376 (comment) the issue remains that Terraform is insisting this:

If automatic_failover_enabled is enabled, number_cache_clusters must be greater than 1

Source: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/elasticache_replication_group#automatic_failover_enabled

Which does not agree with the working config I currently have on AWS:
small_env

So I have to question whether or not the assertion made by Terraform is correct. It is my belief that number of clusters DOES NOT need to be > 1 for automatic_failover_enabled to be TRUE.

I agree with @gdavison here #17605 (comment) that it makes no logical sense setting automatic_failover_enabled to true for a single node since indeed where would it failover to. But AWS permits this and therefore it is incorrect for Terraform enforce otherwise.

I have Serverless code dependencies on cluster mode being set so I cannot specify my single node environments without clustering mode enabled.

References

@ghost ghost added the service/elasticache Issues and PRs that pertain to the elasticache service. label Apr 1, 2021
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Apr 1, 2021
@choikijin-techcross
Copy link

I have same problem.

#17605 (comment)
#17376 (comment)

I saw these 2 comments from related issues, so I feel like the developer of this code predicts that this terraform code works.

automatic_failover_enabled = true
multi_az_enabled           = false

cluster_mode {
  replicas_per_node_group = 0
  num_node_groups         = 1
}

unfortunately, I can't create smallest redis-cluster set @gdavison mentioned. I hope it'll be fixed soon.

@anGie44 anGie44 added thinking and removed needs-triage Waiting for first response or review from a maintainer. labels Apr 1, 2021
@gdavison gdavison self-assigned this Apr 1, 2021
@gdavison
Copy link
Contributor

gdavison commented Apr 1, 2021

For clarification, is part of the issue that you haven't directly configured number_cache_clusters, yet the error message references it? If so, I can try to improve the error message. We currently allow configuring non-cluster mode replication groups using either number_cache_clusters or cluster_mode to make development and production configurations easier. number_cache_clusters == cluster_mode.num_node_groups * (1 + cluster_mode.replicas_per_node_group).

For your existing replication group, can you add a screenshot of the current parameters? In the list of replication groups, click on the triangle to show this. I'm wondering if there's an unexpected configuration, especially since AWS changed how some settings are interpreted.

@gdavison gdavison added the waiting-response Maintainers are waiting on response from community or contributor. label Apr 1, 2021
@choikijin-techcross
Copy link

choikijin-techcross commented Apr 2, 2021

@gdavison I didn't configure number_cache_clusters and the error message is Error: if automatic_failover_enabled is true, number_cache_clusters must be greater than 1.

My setting is same as @DoctorPolski, so

cluster_mode.num_node_groups = 1,
cluster_mode.replicas_per_node_group = 0,
number_cache_clusters == 1 * (1 + 0) = 1

and it's not greater than 1.
So I think the calculation is working, but terraform doesn't allow me to use a smallest set of a cluster.

I didn't add number_cache_clusters in my code, but terraform plan is showing number_cache_clusters on the result.
Maybe terraform is putting number_cache_clusters internally.

In below terraform plan result doesn't show the change of replicas_per_node_group because I already removed my replica on my aws console.
image

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.redis_cluster.aws_elasticache_replication_group.redis_cluster will be updated in-place
  ~ resource "aws_elasticache_replication_group" "redis_cluster" {
        arn                            = "arn:aws:elasticache:ap-northeast-1:xxxxxx:replicationgroup:xxxxxx-redis-cluster"
        at_rest_encryption_enabled     = false
        auto_minor_version_upgrade     = true
      ~ automatic_failover_enabled     = true -> false
        cluster_enabled                = true
        configuration_endpoint_address = "xxxxxx-redis-cluster.gfqrwu.clustercfg.apne1.cache.amazonaws.com"
        engine                         = "redis"
        engine_version                 = "4.0.10"
        id                             = "xxxxxx-redis-cluster"
        maintenance_window             = "tue:05:00-tue:06:00"
        member_clusters                = [
            "xxxxxx-redis-cluster-0001-001",
        ]
        multi_az_enabled               = false
        node_type                      = "cache.t2.small"
        number_cache_clusters          = 1
        parameter_group_name           = "xxxxxx-redis-cluster-params"
        port                           = 6379
        replication_group_description  = "xxxxxx-redis-cluster"
        replication_group_id           = "xxxxxx-redis-cluster"
        security_group_ids             = [
            "sg-981aa5e0",
        ]
        security_group_names           = []
        snapshot_retention_limit       = 5
        snapshot_window                = "14:00-15:00"
        subnet_group_name              = "xxxxxx-redis-subnet-group"
        tags                           = {
            "AppRole" = "redis-cluster"
        }
        transit_encryption_enabled     = false

        cluster_mode {
            num_node_groups         = 1
            replicas_per_node_group = 0
        }
    }

Plan: 0 to add, 1 to change, 0 to destroy.

------------------------------------------------------------------------
$ grep -rn number_cache_clusters modules/elasticache/redis_cluster/

PS. I changed my terraform plan result to my last try to resolve this problem, and like as you think, this terraform code failed to apply because I changed my automatic_failover_enabled to false.

@ghost ghost removed waiting-response Maintainers are waiting on response from community or contributor. labels Apr 2, 2021
@DoctorPolski
Copy link
Author

Thanks for engaging on this Graham.

The error message is a little confusing because effectively it only directly references the number_cache_clusters method for provisioning clusters. Honestly good luck thinking up a concise error message that references both...

These are the params for the cluster referenced in my OP:
small_env_params

I believe we may share the same suspicion that since AWS made their changes that brought about the need for your v3.26.0 update that my config may effectively be no longer possible to create? Could it therefore be that I and others have been left with an orphaned legacy cluster config as a result of those changes?

The only way to test this theory directly would be to attempt to recreate a similar cluster config from scratch using AWS console or CLI. Even if that turned out to be possible the mystery of being able to set automatic_failover_enabled to TRUE for a single node with nothing to failover to would still remain.

In either case I think a support request to AWS is required here in parallel to our investigations.

@gdavison
Copy link
Contributor

gdavison commented Apr 5, 2021

I'll take a look at improving the error messaging. Part of the complication is that we allow setting either, number_cache_clusters or cluster_mode, and calculate the other. When the parameter validations are done, it's picking up the calculated number_cache_clusters.

I'll do some more investigation around the configuration and try to reproduce it. It theoretically should not be possible, at least now, unless there's also some interaction with "Cluster-Mode enabled"/sharding.

Right now, if the validation is turned off, the AWS API returns an error, but I haven't tried it with a cluster.on parameter group.

@DoctorPolski
Copy link
Author

I have tested the creation of a "single node cluster" directly on the console and your suspicions are correct @gdavison, cluster.on does indeed make all the difference.

This is my creation template:

small_create

Selecting the Cluster Mode enabled Redis sub-option dynamically updates the Parameter group setting to include the cluster.on suffix.

Setting Number of Shards to 1 and Replicas per Shard to 0 forcibly unsets the Multi-AZ option and produces the blue info ribbon. We know the provider currently accounts for this as per previous discussion in this issue and related threads.

Resulting cluster:

small_cluster

In respect of clustering, this matches the config of my "small environments".

@gdavison
Copy link
Contributor

gdavison commented Apr 6, 2021

Thanks for testing that, @DoctorPolski. That's not the result I expected… 😀

@ghost
Copy link

ghost commented Apr 9, 2021

This has been released in version 3.36.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@ghost
Copy link

ghost commented May 9, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators May 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
service/elasticache Issues and PRs that pertain to the elasticache service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants