Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecs: "The provided launch template does not expose its user data" when trying to add a second capacity provider #30742

Open
rantoniuk opened this issue Jul 3, 2024 · 3 comments
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. p3

Comments

@rantoniuk
Copy link

rantoniuk commented Jul 3, 2024

Describe the bug

The code below works perfectly fine until the line ----- inf1, so with one gpuCapacityProvider.
When trying to add additional inf1CP capacity provider, with a new LaunchTemplate that does not mention anything about UserData, it errors out on cdk diff with:

Error: The provided launch template does not expose its user data.
    at AutoScalingGroup.get userData [as userData] (infra/cdk/node_modules/aws-cdk-lib/aws-autoscaling/lib/auto-scaling-group.js:1:24056)
    at AutoScalingGroup.addUserData (infra/cdk/node_modules/aws-cdk-lib/aws-autoscaling/lib/auto-scaling-group.js:1:22335)
    at Cluster.configureAutoScalingGroup (infra/cdk/node_modules/aws-cdk-lib/aws-ecs/lib/cluster.js:1:11190)
    at Cluster.addAsgCapacityProvider (infra/cdk/node_modules/aws-cdk-lib/aws-ecs/lib/cluster.js:1:9915)
    at new EcsStack (infra/cdk/lib/ecs-stack.ts:130:18)
    at Object.<anonymous> (infra/cdk/bin/cdk.ts:35:13)
    at Module._compile (node:internal/modules/cjs/loader:1358:14)
    at Module.m._compile (infra/cdk/node_modules/ts-node/src/index.ts:1618:23)
    at Module._extensions..js (node:internal/modules/cjs/loader:1416:10)
    at Object.require.extensions.<computed> [as .ts] (infra/cdk/node_modules/ts-node/src/index.ts:1621:12)

Subprocess exited with error 1

which is specifically caused by this line:

    this.cluster.addAsgCapacityProvider(inf1CP);

import { Stack, StackProps } from 'aws-cdk-lib';
import { AutoScalingGroup, IAutoScalingGroup } from 'aws-cdk-lib/aws-autoscaling';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { AsgCapacityProvider, Cluster } from 'aws-cdk-lib/aws-ecs';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';
import { IEnvironmentConfig } from './helpers/environment-config';

interface EcsStackProps extends StackProps {
  envv: IEnvironmentConfig;
  vpc: ec2.Vpc;
}

export class EcsStack extends Stack {
  readonly cluster: Cluster;
  readonly execRole: iam.IRole;
  readonly gpuAutoScalingGroup: IAutoScalingGroup;

  constructor(scope: Construct, id: string, props: EcsStackProps) {
    super(scope, id, props);

    this.cluster = new Cluster(this, 'EcsCluster', {
      clusterName: 'EcsCluster',
      vpc: props.vpc,
    });

    // Ec2 Security Group
    const gpuinstanceSecurityGroup = new ec2.SecurityGroup(this, 'EcsGpuInstanceSg', {
      securityGroupName: 'EcsGpuInstanceSg',
      description: ' security group for gpu instances for ecs tasks',
      vpc: props.vpc,
    });

    // EC2 Execution Role with access to ECS actions
    const ltRole = new iam.Role(this, 'EcsClusterRole', {
      roleName: 'ecs-cluster-role',
      assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchAgentServerPolicy'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2ContainerServiceforEC2Role'),
      ],
    });

    const rootVolume: ec2.BlockDevice = {
      deviceName: '/dev/xvda',
      volume: ec2.BlockDeviceVolume.ebs(100),
    };

    // set GPU as the default for Docker
    const userData = ec2.UserData.forLinux();
    userData.addCommands(
      'sudo rm /etc/sysconfig/docker',
      'echo DAEMON_MAXFILES=1048576 | sudo tee -a /etc/sysconfig/docker',
      'echo OPTIONS="--default-ulimit nofile=32768:65536 --default-runtime nvidia" | sudo tee -a /etc/sysconfig/docker',
      'echo DAEMON_PIDFILE_TIMEOUT=10 | sudo tee -a /etc/sysconfig/docker',
      'sudo systemctl restart docker',
    );

    // GPU EC2 Launch Template
    const launchTemplate = new ec2.LaunchTemplate(this, 'EcsClusterLt', {
      launchTemplateName: 'ecs-gpu-lt',
      machineImage: ec2.MachineImage.genericLinux({
        // ecs optimised image with gpu support
        'us-west-2': 'ami-027492973b111510a',
      }),
      instanceType: new ec2.InstanceType('g4dn.xlarge'),
      role: ltRole,
      userData: userData,
      securityGroup: gpuinstanceSecurityGroup,
      blockDevices: [rootVolume],
      requireImdsv2: true,
    });

    // Add GPU autoscaling capacity provider to the cluster
    const gpuAutoScalingGroup = new AutoScalingGroup(this, 'EcsGpuASG', {
      autoScalingGroupName: 'EcsGpuASG',
      vpc: props.vpc,
      launchTemplate,
      minCapacity: 0,
      maxCapacity: 1,
    });

    //Add the capacity to the cluster
    const gpuCapacityProvider = new AsgCapacityProvider(this, 'EcsGpuCapacityProvider', {
      autoScalingGroup: gpuAutoScalingGroup,
      capacityProviderName: 'gpuCapacityProvider',
    });

    this.cluster.addAsgCapacityProvider(gpuCapacityProvider);

    this.cluster.addDefaultCloudMapNamespace({
      name: 'local',
      useForServiceConnect: true,
    });

    // ---------------- inf1

    // GPU EC2 Launch Template
    const launchTemplateInf1 = new ec2.LaunchTemplate(this, 'EcsClusterInf1', {
      machineImage: ec2.MachineImage.genericLinux({
        // aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended
        'us-west-2': 'ami-00a3a4671e9889e76',
      }),
      instanceType: new ec2.InstanceType('inf1.2xlarge'),
      role: ltRole,
      securityGroup: gpuinstanceSecurityGroup,
      // blockDevices: [rootVolume],
      requireImdsv2: true,
    });

    const inf1ASG = new AutoScalingGroup(this, 'EcsInf1ASG', {
      autoScalingGroupName: 'EcsInf1ASG',
      vpc: props.vpc,
      launchTemplate: launchTemplateInf1,
      minCapacity: 0,
      maxCapacity: 1,
    });

    //Add the capacity to the cluster
    const inf1CP = new AsgCapacityProvider(this, 'EcsInf1CapacityProvider', {
      autoScalingGroup: inf1ASG,
      capacityProviderName: 'Inf1AsgCapacityProvider',
    });

    this.cluster.addAsgCapacityProvider(inf1CP);

    this.cluster.addDefaultCapacityProviderStrategy([
      { capacityProvider: gpuCapacityProvider.capacityProviderName, weight: 1 },
      { capacityProvider: inf1CP.capacityProviderName, weight: 0 },

    ]);
  }
}

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.146.0 (build b368c78)

Framework Version

No response

Node.js Version

v20.13.1

OS

MacOS

Language

TypeScript

Language Version

"typescript": "~5.2.0"

Other information

No response

@rantoniuk rantoniuk added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 3, 2024
@github-actions github-actions bot added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Jul 3, 2024
@ashishdhingra
Copy link
Contributor

ashishdhingra commented Jul 3, 2024

@rantoniuk Good afternoon. Thanks for opening the issue. The error is perhaps thrown here. Please refer to section Clusters in Amazon ECS Construct Library README. It mentions that To use LaunchTemplate with AsgCapacityProvider, make sure to specify the userData in the LaunchTemplate. Does the error goes away once you explicitly specify userData in 2nd LaunchTemplate (as you did in the 1st LaunchTemplate)?

We also have an open issue #26035 (comment) to improve error messaging in case user data is missing from launch template, however, don't have ETA as of now.

Thanks,
Ashish

@ashishdhingra ashishdhingra added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Jul 3, 2024
@pahud
Copy link
Contributor

pahud commented Jul 3, 2024

Yes.

If you look at the stack trace, it fails at this method:

AutoScalingGroup.addUserData

message: The provided launch template does not expose its user data.

And if you check here:

public get userData(): ec2.UserData {
if (this._userData) {
return this._userData;
}
if (this.launchTemplate?.userData) {
return this.launchTemplate.userData;
}
throw new Error('The provided launch template does not expose its user data.');
}

If launchTemplate is provided, it has to have userData attribute.

Looking at your launchTemplateInf1 obviously it's missing the userData:

const launchTemplateInf1 = new ec2.LaunchTemplate(this, 'EcsClusterInf1', {
      machineImage: ec2.MachineImage.genericLinux({
        // aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended
        'us-west-2': 'ami-00a3a4671e9889e76',
      }),
      instanceType: new ec2.InstanceType('inf1.2xlarge'),
      role: ltRole,
      securityGroup: gpuinstanceSecurityGroup,
      // blockDevices: [rootVolume],
      requireImdsv2: true,
    });

@pahud pahud added the p3 label Jul 3, 2024
@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jul 4, 2024
@rantoniuk
Copy link
Author

rantoniuk commented Jul 4, 2024

Yes, I confirm that fixes the issue:

 const userDataInf1= ec2.UserData.forLinux();

    // GPU EC2 Launch Template
    const launchTemplateInf1 = new ec2.LaunchTemplate(this, 'EcsClusterInf1', {
      machineImage:
        ec2.MachineImage.fromSsmParameter(
          '/aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended/image_id',
        ),
      instanceType: new ec2.InstanceType('inf1.2xlarge'),
      role: ltRole,
      userData: userDataInf1,
      securityGroup: gpuinstanceSecurityGroup,
      // blockDevices: [rootVolume],
      requireImdsv2: true,
    });

However let me ask a follow-up questions then:

  1. Is this a Cloudformation requirement or CDK requirement? If the latter, then I would say that instead of README, CDK should automatically add ec2.UserData.forLinux() unless otherwise defined.

  2. Unrelated to the initial issue, but when I tried to use:

machineImage: ec2.MachineImage.genericLinux({
      machineImage:
        ec2.MachineImage.fromSsmParameter(
          '/aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended',
        ),
      }),

then Cloudformation complained that it can't find imageId. I had to use an undocumented suffix, so '/aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended/image_id' - maybe something to be added to the documentation directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. p3
Projects
None yet
Development

No branches or pull requests

3 participants