Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure: Add check to verify image id #1846

Conversation

kartikjoshi21
Copy link
Contributor

Add check to verify if podvm image is present

Fixes: #1842

@kartikjoshi21
Copy link
Contributor Author

  1. Steps to test this change :
export AZURE_RESOURCE_GROUP="TestPR-1846"
export AZURE_REGION="eastus"
az group create --name "${AZURE_RESOURCE_GROUP}" \
    --location "${AZURE_REGION}"

export AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export USER_ASSIGNED_IDENTITY_NAME="caa-${AZURE_RESOURCE_GROUP}"
az identity create \
    --name "${USER_ASSIGNED_IDENTITY_NAME}" \
    --resource-group "${AZURE_RESOURCE_GROUP}" \
    --location "${AZURE_REGION}" \
    --subscription "${AZURE_SUBSCRIPTION_ID}"

export PRINCIPAL_ID="$(az identity show \
    --name "${USER_ASSIGNED_IDENTITY_NAME}" \
    --resource-group "${AZURE_RESOURCE_GROUP}" \
    --subscription "${AZURE_SUBSCRIPTION_ID}" --query principalId -otsv)"

sleep 30
az role assignment create \
    --role Contributor \
    --assignee-object-id "${PRINCIPAL_ID}" \
    --scope "/subscriptions/${AZURE_SUBSCRIPTION_ID}"

export AZURE_CLIENT_ID="$(az identity show \
    --resource-group "${AZURE_RESOURCE_GROUP}" \
    --name "${USER_ASSIGNED_IDENTITY_NAME}" --query 'clientId' -otsv)"

export CLUSTER_NAME="e2e"
export AZURE_IMAGE_ID="/CommunityGalleries/cocopodvm-d0e4f35f-5530-4b9c-8596-112487cdea85/Images/podvm_image0/Versions/WrongImageVersion"

# Docker image for KBS
# https:/confidential-containers/kbs/pkgs/container/staged-images%2Fkbs

cat <<EOF >/tmp/provision_azure.properties
AZURE_CLIENT_ID="${AZURE_CLIENT_ID}"
AZURE_SUBSCRIPTION_ID="${AZURE_SUBSCRIPTION_ID}"
RESOURCE_GROUP_NAME="${AZURE_RESOURCE_GROUP}"
CLUSTER_NAME="${CLUSTER_NAME}"
LOCATION="${AZURE_REGION}"
SSH_KEY_ID="id_rsa.pub"
AZURE_IMAGE_ID="${AZURE_IMAGE_ID}"

AZURE_CLI_AUTH="true"
MANAGED_IDENTITY_NAME="${USER_ASSIGNED_IDENTITY_NAME}"

# Deploy the same one that is merged on the CAA main
KBS_IMAGE="ghcr.io/confidential-containers/staged-images/kbs"
KBS_IMAGE_TAG="dc01f454264fb4350e5f69eba05683a9a1882c41"

# Get the tag from: https://quay.io/repository/confidential-containers/cloud-api-adaptor?tab=tags&tag=latest
CAA_IMAGE="Use Image with this change" # quay.io/karikjoshi21/cloud-api-adaptor:latest (can use this prebuild image)
EOF

pushd src/cloud-api-adaptor/
ssh-keygen -t rsa -b 4096 -f install/overlays/azure/id_rsa -N "" -C [email protected]

pushd test
git clone [email protected]:confidential-containers/trustee.git

pushd trustee
git checkout dc01f454264fb4350e5f69eba05683a9a1882c41
popd
popd

# Now open a new terminal

export TEST_PROVISION_FILE=/tmp/provision_azure.properties
export CLOUD_PROVIDER=azure
export BUILTIN_CLOUD_PROVIDERS=azure
export DEPLOY_KBS=true
export TEST_PROVISION=true

pushd test/tools
make caa-provisioner-cli
./caa-provisioner-cli -action=provision

popd
  1. Check CAA Pods logs
2s (x7 over 90s)    Warning   FailedCreatePodSandBox        Pod/nginx-7b9964d4b9-8nbpq            Failed to create pod sandbox: rpc error: code = NotFound desc = failed to create containerd task: failed to create shim task: remote hypervisor call failed: rpc error: code = Unknown desc = creating an instance : GET https://management.azure.com/subscriptions/d1aa957b-94f5-49ef-b29a-0178c58a7132/providers/Microsoft.Compute/locations/eastus/communityGalleries/cocopodvm-d0e4f35f-5530-4b9c-8596-112487cdea85/images/podvm_image0/versions/WrongImageVersion
--------------------------------------------------------------------------------
RESPONSE 404: 404 Not Found
ERROR CODE: NotFound
--------------------------------------------------------------------------------
{
  "error": {
    "code": "NotFound",
    "message": "Resource inside community gallery'cocopodvm-d0e4f35f-5530-4b9c-8596-112487cdea85' is not found."
  }
}
--------------------------------------------------------------------------------
: not found
  1. There shouldn't be any network interfaces created on azure resource portal.

Copy link
Contributor

@mkulke mkulke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, like we discussed offline, I am not convinced this a good approach. I think this very check is already performed by the createVM api call and I'm not sure what the additional value is by doing it manually.

If we would start checking preconditons before vm start, why stop at the image, we could check for subnet id, iam rights, instance size availability in a region, etc...

@kartikjoshi21
Copy link
Contributor Author

hmm, like we discussed offline, I am not convinced this a good approach. I think this very check is already performed by the createVM api call and I'm not sure what the additional value is by doing it manually.

If we would start checking preconditons before vm start, why stop at the image, we could check for subnet id, iam rights, instance size availability in a region, etc...

So i checked without this pre image checking condition multiple network interfaces are being created even if image is not available until the range of available addresses is full, so looks like CreateVM api call doesn't actually prevent that as it was created as part of createNetworkInterface in CreateInstance API call

Also we have started with checking for image existence as that was our major concern as might be the case image is deleted or not present in that particular region. We can later on extend this to add more pre checks.

@mkulke
Copy link
Contributor

mkulke commented May 27, 2024

So i checked without this pre image checking condition multiple network interfaces are being created even if image is not available until the range of available addresses is full, so looks like CreateVM api call doesn't actually prevent that as it was created as part of createNetworkInterface in CreateInstance API call

Also we have started with checking for image existence as that was our major concern as might be the case image is deleted or not present in that particular region. We can later on extend this to add more pre checks.

This will plug a particular hole (i.e. image doesn't exist in this region) but there are multiple other ways VM creation can fail, even if the image is present, I don't think we can reasonably cover all those error cases. So, for example, if we're not allowed to spawn a VM in the resource group, because of a typo, we will run into the same problems.

2 questions:

	result, err := p.create(ctx, vmParameters)
	if err != nil {
		if err := p.deleteDisk(context.Background(), diskName); err != nil {
			logger.Printf("deleting disk (%s): %s", diskName, err)
		}
		if err := p.deleteNetworkInterfaceAsync(context.Background(), nicName); err != nil {
			logger.Printf("deleting nic async (%s): %s", nicName, err)
		}
		return nil, fmt.Errorf("Creating instance (%v): %s", result, err)
	}

after a failed VM creation we're supposed to delete the network interface, why do the NICs still leak on failed vm creation? maybe there is a bug in this logic.

the other question: do we have to create the network interface in a separate API call?

looking at the ARM template documentation it should be possible to specify a subnetId in the vm creation call directly?

      "properties": {
        "ipConfigurations": [
          {
            "name": "ipconfig1",
            "properties": {
              "privateIPAllocationMethod": "Dynamic",
              "publicIPAddress": {
                "id": "[resourceId('Microsoft.Network/publicIPAddresses', parameters('publicIpName'))]"
              },
              "subnet": {
                "id": "[resourceId('Microsoft.Network/virtualNetworks/subnets', variables('virtualNetworkName'), variables('subnetName'))]"
              }
            }
          }
        ]
      },

Add check to verify if podvm image is present

Fixes: confidential-containers#1842
Signed-off-by: Kartik Joshi <[email protected]>
@kartikjoshi21 kartikjoshi21 force-pushed the kartikjoshi21/pre-podvmimage-check branch from afbda37 to 913b0b9 Compare May 27, 2024 09:37
@mkulke
Copy link
Contributor

mkulke commented May 28, 2024

@kartikjoshi21 I think the problem with the NIC cleanup not working after a failed VM creation is because it's the deletion operation is async. So it will attempt to

  • create a nic
  • create a vm
  • fail to create vm
  • trigger async delete of nic
  • repeat

this way you'll end up creation a lot of NICs in a busy loop. I think we should delete the NIC synchronous. That should address the problem. then we don't have to probe for image existence.

@kartikjoshi21
Copy link
Contributor Author

kartikjoshi21 commented May 29, 2024

@kartikjoshi21 I think the problem with the NIC cleanup not working after a failed VM creation is because it's the deletion operation is async. So it will attempt to

  • create a nic
  • create a vm
  • fail to create vm
  • trigger async delete of nic
  • repeat

this way you'll end up creation a lot of NICs in a busy loop. I think we should delete the NIC synchronous. That should address the problem. then we don't have to probe for image existence.

Yess i checked that, looks like this is the problem. I will modify this change. Thanks @mkulke

@mkulke
Copy link
Contributor

mkulke commented Sep 26, 2024

this should be superseded by #2056 - a missing image will not cause resource leakage if the resources are passively managed by azure

@mkulke mkulke closed this Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent creation of unnecessary multiple pod vms on failure
2 participants