Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] too many (unnecessary) requests to the metadata URL #1842

Closed
pc-jedi opened this issue May 17, 2022 · 10 comments
Closed

[BUG] too many (unnecessary) requests to the metadata URL #1842

pc-jedi opened this issue May 17, 2022 · 10 comments
Labels
bug Something isn't working help wanted Community contributions are especially encouraged for these issues.

Comments

@pc-jedi
Copy link

pc-jedi commented May 17, 2022

What is the bug?
Whenever there is a write requests (POST, PUT, DELETE) towards the /_plugins/_security/ API each OpenSearch node is requesting a fresh copy of SAML metadata from the IdP.
No matter if it is a role, tenant, user, etc. that gets created/updated/deleted.

Since we have a lot of OpenSearch clusters and we manage the internal user, roles, tenants, etc. for them, there are thousand of requests towards the IdPs metadata.

E.g. if we rotate the passwords for 100 technical user in a cluster with 15 nodes, then 1500 requests get fired towards the same metadata URL.

How can one reproduce the bug?

  1. Use the SAML example setup from here
  2. Bump the OpenSearch version to 1.3.1 and run docker-compose up -d.
  3. Run docker logs -f SAML-IdP to only see the logs of the IdP.
  4. Execute a request agains the /_plugins/_security/ API.
    E.g. create a new internal users /_plugins/_security/api/internalusers/foobar
  5. Check the IdP logs and you will see something like that:
    localhost:80 172.19.0.5 - - [17/May/2022:14:26:50 +0000] "GET /simplesaml/saml2/idp/metadata.php HTTP/1.1" 200 1737 "-" "Apache-HttpClient/4.5.13 (Java/11.0.14.1)"
    localhost:80 172.19.0.4 - - [17/May/2022:14:26:50 +0000] "GET /simplesaml/saml2/idp/metadata.php HTTP/1.1" 200 1737 "-" "Apache-HttpClient/4.5.13 (Java/11.0.14.1)"
    
  6. For each OpenSearch node a single requests gets fired.
  7. Try it with other requests on the /_plugins/_security/ API.

What is the expected behavior?
A single requests towards the metadata URL, with a cache of 60 minute (?!). Only be invalidated by an update of the metadata URL or update of the IdP config (entity_id, etc) (?!)

Some thoughts
I'm not sure if there has to be a request towards the metadata if a role or rolemapping gets updated. Is this really necessary?
Can the metadata be stored in the cluster somehow, so not every node has to fetch the metadata on its own?
Can we have a proper cache?

RC
I suspect that the "clear cache" method is called, which might invalidate also the metadata cache and therefore causes the requests.

What is your host/environment?

  • OS: k8s with OpenSearch images
  • Version: 1.3.1
  • Plugins: The default ones that are in the docker images.
@pc-jedi pc-jedi added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels May 17, 2022
@pc-jedi
Copy link
Author

pc-jedi commented May 18, 2022

@peternied I'm very sorry to just ping you here. But this bug is bothering us quite a lot.

@peternied
Copy link
Member

Thanks for reaching out, @opensearch-project/security Could we have someone look into this?

@DarshitChanpura
Copy link
Member

[Triage] Thank you for filing this. We'll get to this as soon as we can. In the meantime, if we can get as much information as we can it would be very helpful.

@DarshitChanpura DarshitChanpura removed the untriaged Require the attention of the repository maintainers and may need to be prioritized label May 20, 2022
@pc-jedi
Copy link
Author

pc-jedi commented May 23, 2022

In the meantime, if we can get as much information as we can it would be very helpful.

Is there any particular information you would like to see?

Meanwhile I will test the issue with RC of 2.0.

@pc-jedi
Copy link
Author

pc-jedi commented May 24, 2022

After some intensive testing with 2.0 RC I can confirm it has the same behaviour.

Whenever you do one of the following requests it will request the metadata again.

PUT    /_plugins/_security/api/internalusers/<user>
DELETE /_plugins/_security/api/cache
PUT    /_plugins/_security/api/roles/<role>

@peternied
Copy link
Member

@pc-jedi Thank you for your patience while we looked into this. I have confirmed this behavior, the SAML metadata endpoint will be refreshed on any configuration change, easiest confirmation is a cache flush, docs.

This is the design of the security configuration index, upon any change, all nodes refresh and rebuild configuration objects associated with the security configuration. While a more nuanced refresh pattern, this ensures the configuration is always verified. During the SAML initialization process a new http client is created and then metadata is queried, there is no persistent cache and the request is issued.

I briefly looked into how this could be added, but am concerned about our lack of testing in this space where over-caching would be a considerable risk. While 1,500 requests is no small number, could you help me understand how this is impacting cluster stability or the saml services stability?

@peternied peternied added the help wanted Community contributions are especially encouraged for these issues. label May 27, 2022
@pc-jedi
Copy link
Author

pc-jedi commented Jun 3, 2022

I saw that the SamlHTTPMetadataResolver is using the set cache functions iif the extended HTTPMetadataResolver.

Nevertheless, we have 500-5000 clusters in our company, most of them using the internal IdP.
Since IdPs usually have something like an intrusion detection in place, all alarms go off if there are around 100.000 requests per second towards the IdP.

Wouldn't it be possible to only invalidate/request the metadata if the metadata URL changes?
Because I see absolute no reason to request the metadata again for all nodes if there is only a new role introduced.

Besides that we provision new internal tech users on those clusters quite a lot. Thats why the number of changes on the opendistro-security index is high.

Different question, is that considered a "normal" use-case if there is a high frequency of updates to the opendistro-security index? Will this impact performance a lot? How does the "old versions" of that document get cleaned up, because I think you always UPDATE the document, which will leave old version in the file structure, right?

@peternied
Copy link
Member

Wouldn't it be possible to only invalidate/request the metadata if the metadata URL changes?

Unfortunately, the architecture of the security configuration, on update a full recreation of all configuration derived objects is performed. This is an issue because the internal idP shares the same configuration as the security config. Believe me, I would very much like to fix this problem, but its a large body of work.

Different question, is that considered a "normal" use-case if there is a high frequency of updates to the opendistro-security index? Will this impact performance a lot? How does the "old versions" of that document get cleaned up, because I think you always UPDATE the document, which will leave old version in the file structure, right?

@cliu123 do you have context around the scale capabilities for the internal idP?

@pc-jedi
Copy link
Author

pc-jedi commented Jun 29, 2022

Any news here?

We mitigated the issue by having a SAML metadata cache as a separate component in palace. We did this by rewriting the SAML metadata URL so it points to our cache, which is then responding to the OpenSearch nodes.
That is by far not perfect.

Besides changing the SAML metadata URL, we still see some OpenSearch Nodes still trying to connect to the previous URL.
We could not find a clear indication when this happens.
Are only clusters effected that were migrated/upgraded from OpenDistro or does it only happen if the clusters had lots of changes on the .opendistro-security index?

@peternied
Copy link
Member

No news on this front, at this point I am inclined to close out this issue by design. I know this isn't the desired outcome, but it is a reality of the current implementation. I am open to examine proposals/pull requests to resolve these fundamental issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Community contributions are especially encouraged for these issues.
Projects
None yet
Development

No branches or pull requests

3 participants