Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using skills with user assigned managed identity , the msi endpoint fails with a 500 #6778

Open
jamesemann opened this issue Apr 24, 2024 · 49 comments
Assignees
Labels
bug Indicates an unexpected problem or an unintended behavior. needs-triage The issue has just been created and it has not been reviewed by the team.

Comments

@jamesemann
Copy link

Version

Latest

Describe the bug

When using user assigned managed identity in azure app service, invoking a skill fails when requesting a token from the msi endpoint (http://127.0.0.1:41120/msi/token/?api-version=2019-08-01&resource=<redacted>&client_id=<redacted>) - which returns a HTTP 500 (with no detail) followed by the error below:

Failed to acquire token for client credentials. ([Managed Identity] Error Message: An unexpected error occurred while fetching the AAD Token. Managed Identity Correlation ID: 3b30b8c8-c05d-47d4-9e4b-c74af665aadc Use this Correlation ID for further investigation.)

To Reproduce

This is happening in our production system (where previously provisioned bots have been working), I have also reproduced it with a minimal repro on the bot builder skills sample

  1. Clone this repo and open the samples/csharp_dotnetcore/80.skills-simple-bot-to-bot

  2. You will need to setup the following in Azure (using the standard User assigned managed identity stuff)

  • An azure bot configured with User Assigned Managed Identity for the root
  • An azure bot configured with User Assigned Managed Identity for the child
  • A webapp for the root (parent), set it to use User assigned managed identity of the root bot. Also set the AZURE_CLIENT_ID application setting to point to the root bot managed identity client id .
  • A webapp for the child, set it to use User assigned managed identity of the child bot. Also set the AZURE_CLIENT_ID application setting to point to the child bot managed identity client id .
  • Set the messaging endpoints in the parent and root bot according to the sample
  • Now edit the appsettings in the EchoSkillBot (which will be deployed to the child webapp) to
{
  "MicrosoftAppType": "UserAssignedMSI",
  "MicrosoftAppTenantId": "<tenant id>",
  "MicrosoftAppId": "<child managed identity client id>",
  "AllowedCallers": [ "*" ]
}

  • Now edit the appsettings in the SimpleRootBot (which will be deployed to the root webapp) to
{

  "MicrosoftAppType": "UserAssignedMSI",
  "MicrosoftAppTenantId": "<tenant id>",
  "MicrosoftAppId": "<parent managed identity client id>",

  "SkillHostEndpoint": "https://<root webapp domain>/api/skills/",
  "BotFrameworkSkills": [
    {
      "Id": "EchoSkillBot",
      "AppId":  "<child managed identity client id>",
      "SkillEndpoint": "https://<child webapp domain>/api/messages"
    }
  ]
}

  • Publish both projects to their corresponding webapps
  • Test the parent bot in the Azure bot resource, type skill, you will see the following

image

ℹ️ If you look in the parent appinsights, you will see the error

Expected behavior

It should authenticate and communicate with the skill as expected.

Additional context

Please reach out if you need any further info, screen shares etc.

This is impacting a few of our customers, so is high priority for us. Thank you!

@jamesemann jamesemann added bug Indicates an unexpected problem or an unintended behavior. needs-triage The issue has just been created and it has not been reviewed by the team. labels Apr 24, 2024
@jamesemann
Copy link
Author

Hi @ceciliaavila @tracyboehrer would it be possible to triage this item? This is very high priority for our customers

@tracyboehrer
Copy link
Member

@jamesemann Was this previously working in production with MSI?

@jamesemann
Copy link
Author

@jamesemann Was this previously working in production with MSI?

Yes it was @tracyboehrer

@tracyboehrer
Copy link
Member

@jamesemann When did it stop? Any updates to the SDK version?

@jamesemann
Copy link
Author

@jamesemann When did it stop? Any updates to the SDK version?

@tracyboehrer i first spotted it yesterday as it is the first time I’ve provisioned a skill for a few weeks. No recent updates to the sdk version

@jamesemann
Copy link
Author

jamesemann commented Apr 26, 2024

@tracyboehrer apologies I was away from the computer yesterday so couldn't confirm the exact version. I've checked and:

  • our production code is using 4.21.1 (since 24th Nov 2023)
  • the repro using your EchoSkillBot is using 4.22.3

(I see the same behaviour in both)

@tracyboehrer
Copy link
Member

@jamesemann Ya. I wouldn't expect any changes between those two versions. Was trying to isolate if a major jump in SDK version had been made. Like from 4.18 to 4.22 or something larger. We will get setup to repro. Worth noting, we haven't made any explicit changes to this. But dependency changes can be wicked on occasion.

@tracyboehrer
Copy link
Member

@jamesemann How did you deploy the bots? One of us encounters a failure that matches your screenshot when they used AZ commands. I used the ARM Templates (and associated doc) and this sample appears to work normally.

image

@jamesemann
Copy link
Author

@tracyboehrer thank you - interesting info. I created them through the Azure portal - in our product we use the Arm templates (through a template spec) though and the properties look the same.

Let me deploy a new set of resources for the sample using the Arm templates in the doc, and re-test. I'll report back as soon as possible

@tracyboehrer
Copy link
Member

@jamesemann We should be able to compare here too.

@jamesemann
Copy link
Author

@tracyboehrer unfortunately I'm getting the same error (HTTP 500 when requesting the managed identity token) after provisioning the resources using the templates in the bot builder repo.

It seems to be a global problem for me. One thing I haven't tried is deploying to a different Azure subscription, so I'll try that next. I'll report back with the result of that.

@jamesemann
Copy link
Author

Update - same result using the arm templates on new subscription

@jamesemann
Copy link
Author

jamesemann commented Apr 29, 2024

@tracyboehrer I've found the underlying error when we see the HTTP 500. It is visible in the managed identity sign-in logs in Azure AD/Entra

AppId: '{appId}' can not use Managed Service Identity (MSI) as audience in token as it is unsupported. MSI should not be set as audience as it does not accept tokens.

(I can share the activity details privately, if necessary)

@tracyboehrer
Copy link
Member

@jamesemann Still conferring with some about this.

@jamesemann
Copy link
Author

@tracyboehrer any luck on this?

One question I did have was that the sample worked for you - did you use an existing managed identity or create a new one? One thing I have noticed is that this seems to be a problem only with recently created managed identities. We have a lot of existing managed identities for other customers and are not seeing the same behaviour.

@tracyboehrer
Copy link
Member

@jamesemann I created new ones, which is what the ARM templates do. I have confirmed that our support folks can't get the Skill to work. It remains a mystery why mine do. I have confirmed both Root and Skill are MSI.

@jamesemann
Copy link
Author

Hi @tracyboehrer , is there any additional context (or anything!) I can provide to help move this forward? It is unfortunately still impacting our tenant and our customers

@tracyboehrer
Copy link
Member

Status at the moment is that for some it works fine. For example, I don't have an issue, and multiple internal MS groups haven't had issues switching to MSI with Skills.

@jamesemann
Copy link
Author

Thank you @tracyboehrer, can you explain how that impacts the status of this ticket. For example, will it continue to be investigated?

We have several large existing customers this impacts and it will impact new customers too. This problem unfortunately doesn’t seem to be going away , so need a plan.

@tracyboehrer
Copy link
Member

@jamesemann It doesn't change the status at all. Still actively being worked on.

@tracyboehrer
Copy link
Member

@jamesemann These would be customers on their own tenant, correct? Microsoft in general is required to switch from secret based to UserAssignedMSI or certificate. Though some customers are moving to SingleTenant. Still secret based though.

@jamesemann
Copy link
Author

Yes that is correct. We have a few app reg backed bots, mostly on our saas platform but >95% of our marketplace customers (who host an instance of our platform on their own azure tenant) have exclusively user assigned managed identity bots.

@shusson
Copy link

shusson commented May 20, 2024

I don't have a lot of context here, but Microsoft recently started blocking MSI as Audience in a token. Tenants that were using this flow before/around April, should be on an allow list.

@jamesemann
Copy link
Author

@shusson do you have a source for this? This is our suspicion too, although new MSIs on our tenant (created within the last few weeks) are failing. Existing MSIs seem ok.

@tracyboehrer
Copy link
Member

@jamesemann Can you confirm that the MSI client ID is being used as the MicrosoftAppId, and the version of the SDK?

@jamesemann
Copy link
Author

@tracyboehrer yes it is, and the version is 4.22.3.

It is also happening for version 4.21.1 which we have been using in production since 24th November.

@tracyboehrer
Copy link
Member

@jamesemann Thanks. I wouldn't expect there to be a difference in the SDK versions, rather checking out the MSAL dependencies in use.

@jamesemann
Copy link
Author

jamesemann commented May 28, 2024

Hi @tracyboehrer

We have had an ongoing ticket with Azure support for this. We've received the following update which basically confirms our suspicion regarding MSI now being invalid as audience.

Product group was able confirm tenant "<tenant>.onmicrosoft.com" is not in the allowed list. You will need to use a different audience (service principal) for token, as MSI should not be used as audience in token. 

Sharing in case of

  1. It's useful info
  2. Do you know how we can be added to the allow list to preserve the old behaviour?

@tracyboehrer
Copy link
Member

@jamesemann I rather doubt there is a way to get on that list, and I've had this suspicion its working for some now on borrowed time. The alternative would be certificate auth. This has been confirmed to work in JS. Fix merged in DotNet, expecting a patch release this week.

@jamesemann
Copy link
Author

Ok thanks @tracyboehrer . Will there also be a fix for managed identity? I am assuming (maybe incorrectly - please correct me if so) that certificate auth uses an Azure AD/Entra ID app - we have some restrictions around creating Azure AD apps in our customers tenants (we can't create them)

@tracyboehrer
Copy link
Member

@jamesemann There is no known "fix" for MSI at the moment. At least in code. All of this goes through the MS auth packages. There could be other ways to configure this though.

@tracyboehrer
Copy link
Member

@jamesemann Have you every captured the response from MSAL when this happens? I may have found the correct group to talk to. I believe they are going to require the response since it contains the trace and correlation ids. You can send this to me out-of-band.

@jamesemann
Copy link
Author

@tracyboehrer thank you - I can definitely provide this. Can you let me how to capture it and I will prioritise getting it to you

@jamesemann
Copy link
Author

@tracyboehrer what info do I need to provide? Is the MSAL response different to the response I provided in the description (copied here for clarity)

Failed to acquire token for client credentials. ([Managed Identity] Error Message: An unexpected error occurred while fetching the AAD Token. Managed Identity Correlation ID: 3b30b8c8-c05d-47d4-9e4b-c74af665aadc Use this Correlation ID for further investigation.)

@jamesemann
Copy link
Author

Hi @tracyboehrer I know you are busy with various things but wanted to follow up on this as it is still causing our customers issues

@jamesemann
Copy link
Author

Hi @tracyboehrer, can you help?

@MattB-msft
Copy link
Member

Hi @jamesemann , Sorry for the delay in responding here,
@tracyboehrer had suggested that you switch over the certificate flows to unblock yourself due to the requirement change by Entra ID with regards to managed identities.

Did you do that? and did that resolve the problem for you?

@jamesemann
Copy link
Author

jamesemann commented Aug 19, 2024

@MattB-msft thank you.

I did try that - however this workaround requires an app registration which we are not permitted to create in our customers tenants due to security policies. So unfortunately it did not resolve the problem.

@MattB-msft
Copy link
Member

Ok, thanks for that information, We will chat about this internally.

@jamesemann
Copy link
Author

Hi @MattB-msft - have you been able to progress this on your side? If so would you be able to share an update - thank you!

@jamesemann
Copy link
Author

Hi @MattB-msft @tracyboehrer @sw-joelmut @ceciliaavila . I'm not sure who best to direct this to - hopefully one of you. Would it be possible to get an update as this is still impacting our customers. Thank you

@linkcd
Copy link

linkcd commented Sep 20, 2024

I also encountered this error "can not use Managed Service Identity (MSI) as audience in token as it is unsupported. MSI should not be set as audience as it does not accept tokens" while the code worked perfectly a few months back.

Sorry for not paying attending to any announcement of this change in Entra ID, but what is the recommended alternative that is close to the managed identity approach?

thanks

@jamesemann
Copy link
Author

Hi @MattB-msft, @tracyboehrer, @sw-joelmut, and @ceciliaavila,

I’m following up again on this issue, as it continues to impact our customers without any resolution in sight. We’ve been waiting for an update for several months now, and the lack of progress is becoming increasingly problematic for us.

As mentioned previously, we’re unable to use the certificate-based authentication workaround due to restrictions on creating app registrations in our customers’ tenants. Given the extended timeline and the ongoing issues this is causing, could you please provide an update or clarify what the next steps are? It’s critical that we find a resolution as soon as possible.

If there’s anything further you need from us to help move this forward, please let me know.

@jamesemann
Copy link
Author

Hi @MattB-msft and @tracyboehrer, would one of you be able to provide an update? Thanks!

@linkcd
Copy link

linkcd commented Oct 7, 2024

i went for a different approach for getting my project up and running again, you can check the details here : https://feng.lu/2024/09/18/How-to-secretless-access-Azure-and-AWS-resources-with-Azure-managed-identity-and-AWS-IAM/

It does require creating an app in entra id, so I guess it is not helpline you @jamesemann ?

@jamesemann
Copy link
Author

Thank you for the information @linkcd . I appreciate you taking the time to write this up.

Unfortunately because it requires editing an app reg, we are not able to use this workaround.

@jamesemann
Copy link
Author

@MattB-msft and @tracyboehrer - could you please provide an update - many thanks!

@jamesemann
Copy link
Author

@MattB-msft and @tracyboehrer - would you be able to provide an update? thank you!

@jamesemann
Copy link
Author

Hi @MattB-msft, @tracyboehrer, and team,

I'm following up again as this issue continues to impact our customers with no clear resolution in sight. Unfortunately, due to restrictions in our customers environments, we are unable to implement the app registration-based workaround suggested by @linkcd.

Could you please provide an update on the current status of this issue, or any alternative solutions that may be applicable? We need to find a resolution as this is significantly affecting our customers' environments.

Thank you for your attention to this, and please let us know if you require any further details from us to help move this forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or an unintended behavior. needs-triage The issue has just been created and it has not been reviewed by the team.
Projects
None yet
Development

No branches or pull requests

6 participants