azure: Add check to verify image id #1846

kartikjoshi21 · 2024-05-27T05:42:07Z

Add check to verify if podvm image is present

kartikjoshi21 · 2024-05-27T05:58:57Z

Steps to test this change :

export AZURE_RESOURCE_GROUP="TestPR-1846"
export AZURE_REGION="eastus"
az group create --name "${AZURE_RESOURCE_GROUP}" \
    --location "${AZURE_REGION}"

export AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export USER_ASSIGNED_IDENTITY_NAME="caa-${AZURE_RESOURCE_GROUP}"
az identity create \
    --name "${USER_ASSIGNED_IDENTITY_NAME}" \
    --resource-group "${AZURE_RESOURCE_GROUP}" \
    --location "${AZURE_REGION}" \
    --subscription "${AZURE_SUBSCRIPTION_ID}"

export PRINCIPAL_ID="$(az identity show \
    --name "${USER_ASSIGNED_IDENTITY_NAME}" \
    --resource-group "${AZURE_RESOURCE_GROUP}" \
    --subscription "${AZURE_SUBSCRIPTION_ID}" --query principalId -otsv)"

sleep 30
az role assignment create \
    --role Contributor \
    --assignee-object-id "${PRINCIPAL_ID}" \
    --scope "/subscriptions/${AZURE_SUBSCRIPTION_ID}"

export AZURE_CLIENT_ID="$(az identity show \
    --resource-group "${AZURE_RESOURCE_GROUP}" \
    --name "${USER_ASSIGNED_IDENTITY_NAME}" --query 'clientId' -otsv)"

export CLUSTER_NAME="e2e"
export AZURE_IMAGE_ID="/CommunityGalleries/cocopodvm-d0e4f35f-5530-4b9c-8596-112487cdea85/Images/podvm_image0/Versions/WrongImageVersion"

# Docker image for KBS
# https:/confidential-containers/kbs/pkgs/container/staged-images%2Fkbs

cat <<EOF >/tmp/provision_azure.properties
AZURE_CLIENT_ID="${AZURE_CLIENT_ID}"
AZURE_SUBSCRIPTION_ID="${AZURE_SUBSCRIPTION_ID}"
RESOURCE_GROUP_NAME="${AZURE_RESOURCE_GROUP}"
CLUSTER_NAME="${CLUSTER_NAME}"
LOCATION="${AZURE_REGION}"
SSH_KEY_ID="id_rsa.pub"
AZURE_IMAGE_ID="${AZURE_IMAGE_ID}"

AZURE_CLI_AUTH="true"
MANAGED_IDENTITY_NAME="${USER_ASSIGNED_IDENTITY_NAME}"

# Deploy the same one that is merged on the CAA main
KBS_IMAGE="ghcr.io/confidential-containers/staged-images/kbs"
KBS_IMAGE_TAG="dc01f454264fb4350e5f69eba05683a9a1882c41"

# Get the tag from: https://quay.io/repository/confidential-containers/cloud-api-adaptor?tab=tags&tag=latest
CAA_IMAGE="Use Image with this change" # quay.io/karikjoshi21/cloud-api-adaptor:latest (can use this prebuild image)
EOF

pushd src/cloud-api-adaptor/
ssh-keygen -t rsa -b 4096 -f install/overlays/azure/id_rsa -N "" -C [email protected]

pushd test
git clone [email protected]:confidential-containers/trustee.git

pushd trustee
git checkout dc01f454264fb4350e5f69eba05683a9a1882c41
popd
popd

# Now open a new terminal

export TEST_PROVISION_FILE=/tmp/provision_azure.properties
export CLOUD_PROVIDER=azure
export BUILTIN_CLOUD_PROVIDERS=azure
export DEPLOY_KBS=true
export TEST_PROVISION=true

pushd test/tools
make caa-provisioner-cli
./caa-provisioner-cli -action=provision

popd

Check CAA Pods logs

2s (x7 over 90s)    Warning   FailedCreatePodSandBox        Pod/nginx-7b9964d4b9-8nbpq            Failed to create pod sandbox: rpc error: code = NotFound desc = failed to create containerd task: failed to create shim task: remote hypervisor call failed: rpc error: code = Unknown desc = creating an instance : GET https://management.azure.com/subscriptions/d1aa957b-94f5-49ef-b29a-0178c58a7132/providers/Microsoft.Compute/locations/eastus/communityGalleries/cocopodvm-d0e4f35f-5530-4b9c-8596-112487cdea85/images/podvm_image0/versions/WrongImageVersion
--------------------------------------------------------------------------------
RESPONSE 404: 404 Not Found
ERROR CODE: NotFound
--------------------------------------------------------------------------------
{
  "error": {
    "code": "NotFound",
    "message": "Resource inside community gallery'cocopodvm-d0e4f35f-5530-4b9c-8596-112487cdea85' is not found."
  }
}
--------------------------------------------------------------------------------
: not found

There shouldn't be any network interfaces created on azure resource portal.

mkulke

hmm, like we discussed offline, I am not convinced this a good approach. I think this very check is already performed by the createVM api call and I'm not sure what the additional value is by doing it manually.

If we would start checking preconditons before vm start, why stop at the image, we could check for subnet id, iam rights, instance size availability in a region, etc...

kartikjoshi21 · 2024-05-27T08:02:50Z

hmm, like we discussed offline, I am not convinced this a good approach. I think this very check is already performed by the createVM api call and I'm not sure what the additional value is by doing it manually.

If we would start checking preconditons before vm start, why stop at the image, we could check for subnet id, iam rights, instance size availability in a region, etc...

So i checked without this pre image checking condition multiple network interfaces are being created even if image is not available until the range of available addresses is full, so looks like CreateVM api call doesn't actually prevent that as it was created as part of createNetworkInterface in CreateInstance API call

Also we have started with checking for image existence as that was our major concern as might be the case image is deleted or not present in that particular region. We can later on extend this to add more pre checks.

mkulke · 2024-05-27T09:11:03Z

So i checked without this pre image checking condition multiple network interfaces are being created even if image is not available until the range of available addresses is full, so looks like CreateVM api call doesn't actually prevent that as it was created as part of createNetworkInterface in CreateInstance API call

Also we have started with checking for image existence as that was our major concern as might be the case image is deleted or not present in that particular region. We can later on extend this to add more pre checks.

This will plug a particular hole (i.e. image doesn't exist in this region) but there are multiple other ways VM creation can fail, even if the image is present, I don't think we can reasonably cover all those error cases. So, for example, if we're not allowed to spawn a VM in the resource group, because of a typo, we will run into the same problems.

2 questions:

	result, err := p.create(ctx, vmParameters)
	if err != nil {
		if err := p.deleteDisk(context.Background(), diskName); err != nil {
			logger.Printf("deleting disk (%s): %s", diskName, err)
		}
		if err := p.deleteNetworkInterfaceAsync(context.Background(), nicName); err != nil {
			logger.Printf("deleting nic async (%s): %s", nicName, err)
		}
		return nil, fmt.Errorf("Creating instance (%v): %s", result, err)
	}

after a failed VM creation we're supposed to delete the network interface, why do the NICs still leak on failed vm creation? maybe there is a bug in this logic.

the other question: do we have to create the network interface in a separate API call?

looking at the ARM template documentation it should be possible to specify a subnetId in the vm creation call directly?

      "properties": {
        "ipConfigurations": [
          {
            "name": "ipconfig1",
            "properties": {
              "privateIPAllocationMethod": "Dynamic",
              "publicIPAddress": {
                "id": "[resourceId('Microsoft.Network/publicIPAddresses', parameters('publicIpName'))]"
              },
              "subnet": {
                "id": "[resourceId('Microsoft.Network/virtualNetworks/subnets', variables('virtualNetworkName'), variables('subnetName'))]"
              }
            }
          }
        ]
      },

Add check to verify if podvm image is present Fixes: confidential-containers#1842 Signed-off-by: Kartik Joshi <[email protected]>

mkulke · 2024-05-28T14:25:39Z

@kartikjoshi21 I think the problem with the NIC cleanup not working after a failed VM creation is because it's the deletion operation is async. So it will attempt to

create a nic
create a vm
fail to create vm
trigger async delete of nic
repeat

this way you'll end up creation a lot of NICs in a busy loop. I think we should delete the NIC synchronous. That should address the problem. then we don't have to probe for image existence.

kartikjoshi21 · 2024-05-29T04:44:53Z

@kartikjoshi21 I think the problem with the NIC cleanup not working after a failed VM creation is because it's the deletion operation is async. So it will attempt to

create a nic

create a vm

fail to create vm

trigger async delete of nic

repeat

this way you'll end up creation a lot of NICs in a busy loop. I think we should delete the NIC synchronous. That should address the problem. then we don't have to probe for image existence.

Yess i checked that, looks like this is the problem. I will modify this change. Thanks @mkulke

mkulke · 2024-09-26T08:20:10Z

this should be superseded by #2056 - a missing image will not cause resource leakage if the resources are passively managed by azure

kartikjoshi21 requested review from mkulke and surajssd May 27, 2024 05:42

mkulke reviewed May 27, 2024

View reviewed changes

azure: Add check to verify image id

913b0b9

Add check to verify if podvm image is present Fixes: confidential-containers#1842 Signed-off-by: Kartik Joshi <[email protected]>

kartikjoshi21 force-pushed the kartikjoshi21/pre-podvmimage-check branch from afbda37 to 913b0b9 Compare May 27, 2024 09:37

mkulke closed this Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure: Add check to verify image id #1846

azure: Add check to verify image id #1846

kartikjoshi21 commented May 27, 2024

kartikjoshi21 commented May 27, 2024

mkulke left a comment

kartikjoshi21 commented May 27, 2024

mkulke commented May 27, 2024

mkulke commented May 28, 2024

kartikjoshi21 commented May 29, 2024 •

edited

Loading

mkulke commented Sep 26, 2024

azure: Add check to verify image id #1846

azure: Add check to verify image id #1846

Conversation

kartikjoshi21 commented May 27, 2024

kartikjoshi21 commented May 27, 2024

mkulke left a comment

Choose a reason for hiding this comment

kartikjoshi21 commented May 27, 2024

mkulke commented May 27, 2024

mkulke commented May 28, 2024

kartikjoshi21 commented May 29, 2024 • edited Loading

mkulke commented Sep 26, 2024

kartikjoshi21 commented May 29, 2024 •

edited

Loading