Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between secret reconciler and object reference index #5175

Closed
1 task done
backjo opened this issue Nov 16, 2023 · 6 comments · Fixed by #5238
Closed
1 task done

Race condition between secret reconciler and object reference index #5175

backjo opened this issue Nov 16, 2023 · 6 comments · Fixed by #5238
Assignees
Labels
bug Something isn't working pending author feedback

Comments

@backjo
Copy link
Contributor

backjo commented Nov 16, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We recently noticed similar errors to #4672 where our secrets were getting "Not Found" errors despite very much existing and being referenced by KongConsumer via credentials. After checking our CRDs as mentioned in that issue, we added some custom logging to understand why our secrets were not getting populated into the cache. The custom logging revealed that the controller for Secrets was evaluating reference checks before the controller for KongConsumer had reconciled the consumers and populated the references, resulting in 0 secrets being reconciled.

Expected Behavior

All initial resources should be loaded before calculating references.

Steps To Reproduce

No response

Kong Ingress Controller version

3.0.0, but appears to be happening in 2.11.* as well

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.9", GitCommit:"a1a87a0a2bcd605820920c6b0e618a8ab7d117d4", GitTreeState:"clean", BuildDate:"2023-04-12T12:16:51Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.10-eks-4f4795d", GitCommit:"164dfb62db432c0b28a1fced3956256af68533b6", GitTreeState:"clean", BuildDate:"2023-10-20T23:21:27Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Anything else?

Similer to #4672

@backjo backjo added the bug Something isn't working label Nov 16, 2023
@randmonkey randmonkey self-assigned this Nov 17, 2023
@randmonkey
Copy link
Contributor

randmonkey commented Nov 17, 2023

I checked the code, When the KongConsumer gets reconciled, the controller will retrieve the referred secrets in its credentials and fetch the secrets in cluster then add them to cache.
This satisfies the eventual consistency, where the KongConsumer and Secret used in its credentials are all added to the cache after KongConsumer gets reconciled.
What other unexpected behaviors did you found other than the "Not Found" logs?

@backjo
Copy link
Contributor Author

backjo commented Nov 18, 2023

Hey @randmonkey - the main unexpected behavior we observe is consistent with the credentials being loaded - namely that our ACL plugin is blocking requests for ~5-10 minutes after startup due to the credential presented not being found.

@backjo
Copy link
Contributor Author

backjo commented Nov 18, 2023

Going to close this until I can provide more debug info

@backjo backjo closed this as completed Nov 18, 2023
@backjo
Copy link
Contributor Author

backjo commented Nov 27, 2023

hi @randmonkey - did some further digging here and the behavior I generally can reproduce is:

  1. Controller starts up and reconciles KongConsumer objects. Secrets have not yet been loaded, so it requeues the reconcile operation.
  2. Controller writes initial configuration to the Kong Admin API - without information from Secret objects. For us, this means that it doesn't write the relevant "groups" information from credentials for the ACL plugin to function.
  3. KongConsumer objects are reconciled on the requeue, which loads the credentials successfully.
  4. Controller writes a second configuration to the Kong Admin API with updated groups.

The second update to the admin api happens a few seconds later from the logs we put in place, but for the few seconds between the first and second writes, we have an effectively 'broken' configuration.

@backjo backjo reopened this Nov 27, 2023
@backjo
Copy link
Contributor Author

backjo commented Nov 27, 2023

This seems fixable by #2249 - maybe the best intermediate solve here is to allow InitCacheSyncDuration to be configurable instead of just hard coded to 5s

@backjo
Copy link
Contributor Author

backjo commented Nov 27, 2023

Ah - I think there was a regression to the #2249 fix here. In #4101, InitCacheSyncDuration started being passed in when the synchronizer is created, but InitCacheSyncDuration is not being initialized to anything while previously DefaultCacheSyncWaitDuration in synchronizer.go was being initialized to 5 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending author feedback
Projects
None yet
2 participants