Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eks): add neuron device plugin for Inferentia managed node groups #27427

Conversation

freschri
Copy link
Contributor

@freschri freschri commented Oct 6, 2023

In an EKS cluster with Inferentia instances, Neuron device plugin is only installed in case of autoscaling groups.
The change expands the logic to managed node groups.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK p2 labels Oct 6, 2023
@aws-cdk-automation aws-cdk-automation requested a review from a team October 6, 2023 07:33
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 7d11e7b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@kaizencc kaizencc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is going on here re: addNeuronDevicePlugin, and there's not really any additional information for me to figure it out. Is there a linked issue that can provide more background? Can you state more clearly what the intention is of this PR in the PR description? Lastly, what we definitely will need is at least 1 unit test

@freschri
Copy link
Contributor Author

freschri commented Oct 7, 2023

I'm not sure what is going on here re: addNeuronDevicePlugin, and there's not really any additional information for me to figure it out. Is there a linked issue that can provide more background? Can you state more clearly what the intention is of this PR in the PR description? Lastly, what we definitely will need is at least 1 unit test

You can find a description of what the Neuron device plugin is, in here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html (tap on Deploy Neuron...):
"Neuron device plugin exposes Neuron cores & devices to kubernetes as a resource. aws.amazon.com/neuroncore, aws.amazon.com/neurondevice, aws.amazon.com/neuron are the resources that the neuron device plugin registers with the kubernetes. aws.amazon.com/neuroncore is used for allocating neuron cores to the container. aws.amazon.com/neurondevice is used for allocating neuron devices to the container. When neurondevice is used all the cores belonging to the device will be allocated to container."
The device plugin is then required not only when instances are launched in an auto scaling group but also in managed group.

The change is replicating the same logic present in addAutoScalingGroupCapacity, see here:

if (nodeTypeForInstanceType(options.instanceType) === NodeType.INFERENTIA) {

Note that e.g. eksctl takes care of that, see here https://docs.aws.amazon.com/eks/latest/userguide/inferentia-support.html:
"The eksctl utility detects that you are launching a node group with an Inf1 instance type and will start your nodes using one of the Amazon EKS optimized accelerated Amazon Linux AMIs. "

The CDK needs to implement the same behaviour, otherwise managed node groups with Inferentia and Trainium devices will not be fully supported.

@kaizencc kaizencc added pr-linter/exempt-readme The PR linter will not require README changes pr-linter/exempt-integ-test The PR linter will not require integ test changes labels Oct 9, 2023
@kaizencc
Copy link
Contributor

kaizencc commented Oct 9, 2023

@freschri I put the exempt-integ-test and exempt-readme labels on your PR but we will definitely need some type of unit test at least before approving this PR

@aws-cdk-automation
Copy link
Collaborator

This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@aws-cdk-automation aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Nov 4, 2023
@aws-cdk-automation
Copy link
Collaborator

The pull request linter fails with the following errors:

❌ Features must contain a change to a test file.

PRs must pass status checks before we can provide a meaningful review.

If you would like to request an exemption from the status checks or clarification on feedback, please leave a comment on this PR containing Exemption Request and/or Clarification Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p2 pr-linter/exempt-integ-test The PR linter will not require integ test changes pr-linter/exempt-readme The PR linter will not require README changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants