Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMR spot/on demand instance group creation fails after 10 minutes #14093

Closed
pasalkarsachin1 opened this issue Jul 8, 2020 · 6 comments · Fixed by #14106
Closed

EMR spot/on demand instance group creation fails after 10 minutes #14093

pasalkarsachin1 opened this issue Jul 8, 2020 · 6 comments · Fixed by #14106
Labels
bug Addresses a defect in current functionality. service/emr Issues and PRs that pertain to the emr service.
Milestone

Comments

@pasalkarsachin1
Copy link
Contributor

pasalkarsachin1 commented Jul 8, 2020

My Terraform is failing as aws_emr_instance_group is not able to get my spot/on demand instance in RUNNING state before 10 minutes, causing the Terraform job to fail. Every time I re-execute the job it creates a new instance group with same name but fails again. This creates un-necessary groups as the old group came up after sometime but as TF was failed it didn't updated the group in state file. It has hardcoded wait time on 10 minutes
image

Terraform Version

0.12.20
hashicorp/aws 2.69.0

Affected Resource(s)

  • aws_emr_instance_group

Terraform Configuration Files

resource "aws_emr_instance_group" "spot_task" {
  count              = local.count
  cluster_id         = aws_emr_cluster.main_emr_cluster[0].id
  name               = "${local.cluster_name}_spot"
  instance_type      = var.spot_task_instance_type
  instance_count     = var.spot_task_instance_initial_count
  bid_price          = var.spot_bid_price
  autoscaling_policy = data.template_file.spot_autoscaling_policy.rendered
  ebs_config {
    size                 = var.spot_ebs_size
    type                 = "gp2"
    volumes_per_instance = 1
  }

}

resource "aws_emr_instance_group" "demand_task" {
  count              = local.count
  cluster_id         = aws_emr_cluster.main_emr_cluster[0].id
  name               = "${local.cluster_name}_on-demand"
  instance_type      = var.on_demand_task_instance_type
  instance_count     = var.on_demand_task_instance_initial_count
  autoscaling_policy = data.template_file.on_demand_autoscaling_policy.rendered

  ebs_config {
    size                 = var.on_demand_ebs_size
    type                 = "gp2"
    volumes_per_instance = 1
  }


}

Debug Output

Panic Output

Expected Behavior

Atleast Instance creation wait time should be configurable, so my terraform will not fail.

Actual Behavior

error waiting for EMR Instance Group (ig-XXXXXX) creation: timeout while waiting for state to become 'RUNNING' (last state: 'RESIZING', timeout: 10m0s)

Steps to Reproduce

  1. terraform apply

Important Factoids

References

  • #0000
@ghost ghost added the service/emr Issues and PRs that pertain to the emr service. label Jul 8, 2020
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jul 8, 2020
@breathingdust breathingdust added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Jul 10, 2020
@admssa
Copy link

admssa commented Jul 21, 2020

I had a similar issue when I used nat instances instead of nat GWs in my legacy VPCs.

The point is that EMR nodes are mutable. All necessary components will be installed on it after the launch.
It consumes a huge amount of traffic in a short period of time. From the log, it was clear that at some moment, scripts started working very slowly. My investigation led me to the nat gateways (t2.micro) instances. They consumed all CPU credits and couldn't pass traffic quickly to EMR nodes.
It was solved by moving EMRs to VPCs with nat GWs and s3 endpoints.

Unfortunately, I can't say for sure this is the cause of your issue. But perhaps this will help someone.

@pasalkarsachin1
Copy link
Contributor Author

Thanks @admssa. I do have VPC with nat GW & S3 endpoint as well setup. However, it's still failing. In any case the time out is too low if I see the EMR cluster timeout which is set up to 75 minutes, that's why I raised a pull request with 30 Minutes of timeout (I have seen some of my instances gets up in 20 min in worst condition )

@admssa
Copy link

admssa commented Jul 22, 2020

Yeah, saw it and added+1. Unfortunately, this won't solve the issue with autoscaling which is the only advantage of instance groups. (IG may stuck in resizing state b/c of timeout on aws side).

@bflad bflad added this to the v3.4.0 milestone Aug 27, 2020
@bflad
Copy link
Contributor

bflad commented Aug 27, 2020

The timeout increases have been merged and will release with version 3.4.0 of the Terraform AWS Provider, likely later today. Thanks to @pasalkarsachin1 for the implementation. 👍

@ghost
Copy link

ghost commented Aug 27, 2020

This has been released in version 3.4.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@ghost
Copy link

ghost commented Sep 27, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Sep 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/emr Issues and PRs that pertain to the emr service.
Projects
None yet
4 participants