EMR spot/on demand instance group creation fails after 10 minutes #14093

pasalkarsachin1 · 2020-07-08T13:46:22Z

My Terraform is failing as aws_emr_instance_group is not able to get my spot/on demand instance in RUNNING state before 10 minutes, causing the Terraform job to fail. Every time I re-execute the job it creates a new instance group with same name but fails again. This creates un-necessary groups as the old group came up after sometime but as TF was failed it didn't updated the group in state file. It has hardcoded wait time on 10 minutes

Terraform Version

0.12.20
hashicorp/aws 2.69.0

Affected Resource(s)

aws_emr_instance_group

Terraform Configuration Files

resource "aws_emr_instance_group" "spot_task" {
  count              = local.count
  cluster_id         = aws_emr_cluster.main_emr_cluster[0].id
  name               = "${local.cluster_name}_spot"
  instance_type      = var.spot_task_instance_type
  instance_count     = var.spot_task_instance_initial_count
  bid_price          = var.spot_bid_price
  autoscaling_policy = data.template_file.spot_autoscaling_policy.rendered
  ebs_config {
    size                 = var.spot_ebs_size
    type                 = "gp2"
    volumes_per_instance = 1
  }

}

resource "aws_emr_instance_group" "demand_task" {
  count              = local.count
  cluster_id         = aws_emr_cluster.main_emr_cluster[0].id
  name               = "${local.cluster_name}_on-demand"
  instance_type      = var.on_demand_task_instance_type
  instance_count     = var.on_demand_task_instance_initial_count
  autoscaling_policy = data.template_file.on_demand_autoscaling_policy.rendered

  ebs_config {
    size                 = var.on_demand_ebs_size
    type                 = "gp2"
    volumes_per_instance = 1
  }


}

Debug Output

Panic Output

Expected Behavior

Atleast Instance creation wait time should be configurable, so my terraform will not fail.

Actual Behavior

error waiting for EMR Instance Group (ig-XXXXXX) creation: timeout while waiting for state to become 'RUNNING' (last state: 'RESIZING', timeout: 10m0s)

Steps to Reproduce

terraform apply

Important Factoids

References

#0000

The text was updated successfully, but these errors were encountered:

admssa · 2020-07-21T18:12:14Z

I had a similar issue when I used nat instances instead of nat GWs in my legacy VPCs.

The point is that EMR nodes are mutable. All necessary components will be installed on it after the launch.
It consumes a huge amount of traffic in a short period of time. From the log, it was clear that at some moment, scripts started working very slowly. My investigation led me to the nat gateways (t2.micro) instances. They consumed all CPU credits and couldn't pass traffic quickly to EMR nodes.
It was solved by moving EMRs to VPCs with nat GWs and s3 endpoints.

Unfortunately, I can't say for sure this is the cause of your issue. But perhaps this will help someone.

pasalkarsachin1 · 2020-07-22T07:58:21Z

Thanks @admssa. I do have VPC with nat GW & S3 endpoint as well setup. However, it's still failing. In any case the time out is too low if I see the EMR cluster timeout which is set up to 75 minutes, that's why I raised a pull request with 30 Minutes of timeout (I have seen some of my instances gets up in 20 min in worst condition )

admssa · 2020-07-22T08:21:08Z

Yeah, saw it and added+1. Unfortunately, this won't solve the issue with autoscaling which is the only advantage of instance groups. (IG may stuck in resizing state b/c of timeout on aws side).

bflad · 2020-08-27T18:41:23Z

The timeout increases have been merged and will release with version 3.4.0 of the Terraform AWS Provider, likely later today. Thanks to @pasalkarsachin1 for the implementation. 👍

ghost · 2020-08-27T22:41:23Z

This has been released in version 3.4.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost · 2020-09-27T17:11:03Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

ghost added the service/emr Issues and PRs that pertain to the emr service. label Jul 8, 2020

github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jul 8, 2020

pasalkarsachin1 mentioned this issue Jul 9, 2020

Issue 14093: increased the EMR instance creation/updation timeout #14106

Merged

breathingdust added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Jul 10, 2020

bflad added this to the v3.4.0 milestone Aug 27, 2020

bflad closed this as completed in #14106 Aug 27, 2020

ghost locked and limited conversation to collaborators Sep 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EMR spot/on demand instance group creation fails after 10 minutes #14093

EMR spot/on demand instance group creation fails after 10 minutes #14093

pasalkarsachin1 commented Jul 8, 2020 •

edited

Loading

admssa commented Jul 21, 2020 •

edited

Loading

pasalkarsachin1 commented Jul 22, 2020

admssa commented Jul 22, 2020

bflad commented Aug 27, 2020

ghost commented Aug 27, 2020

ghost commented Sep 27, 2020

EMR spot/on demand instance group creation fails after 10 minutes #14093

EMR spot/on demand instance group creation fails after 10 minutes #14093

Comments

pasalkarsachin1 commented Jul 8, 2020 • edited Loading

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

admssa commented Jul 21, 2020 • edited Loading

pasalkarsachin1 commented Jul 22, 2020

admssa commented Jul 22, 2020

bflad commented Aug 27, 2020

ghost commented Aug 27, 2020

ghost commented Sep 27, 2020

pasalkarsachin1 commented Jul 8, 2020 •

edited

Loading

admssa commented Jul 21, 2020 •

edited

Loading