Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use DynamicCRDController with Kong controllers #4619

Closed
wants to merge 5 commits into from

Conversation

czeslavo
Copy link
Contributor

@czeslavo czeslavo commented Sep 6, 2023

What this PR does / why we need it:

Use DynamicCRDController instead of a single ShouldEnableCRDController call on startup to decide whether Kong CRDs' controllers should be run. This should fix the issue of these controllers not being started in a scenario where there's a temporary API server connection problem (e.g. because of waiting for Istio sidecar to startup).

Which issue this PR fixes:

Should fix #4618.

Special notes for your reviewer:

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

  • the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

@czeslavo czeslavo self-assigned this Sep 6, 2023
@czeslavo czeslavo added area/feature New feature or request and removed size/L labels Sep 6, 2023
@codecov
Copy link

codecov bot commented Sep 6, 2023

Codecov Report

Patch coverage: 85.4% and project coverage change: +0.1% 🎉

Comparison is base (0e04e18) 68.0% compared to head (21f069b) 68.1%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #4619     +/-   ##
=======================================
+ Coverage   68.0%   68.1%   +0.1%     
=======================================
  Files        163     163             
  Lines      19091   19067     -24     
=======================================
+ Hits       12992   12996      +4     
+ Misses      5331    5298     -33     
- Partials     768     773      +5     
Files Changed Coverage Δ
internal/cmd/rootcmd/run.go 19.0% <0.0%> (-2.1%) ⬇️
internal/manager/conditions.go 89.6% <ø> (-2.1%) ⬇️
internal/controllers/crds/dynamic_controller.go 73.8% <33.3%> (+56.3%) ⬆️
internal/manager/run.go 54.2% <69.4%> (+0.8%) ⬆️
internal/manager/controllerdef.go 98.7% <100.0%> (-0.2%) ⬇️
internal/manager/setup.go 75.5% <100.0%> (ø)

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rainest
Copy link
Contributor

rainest commented Sep 7, 2023

This will still let routes go missing during the problem period.

Disabling the controllers doesn't disable the accompanying section of the parser, so in DB mode the instance will come up, disable some resource controller, see 0 such resources listed when the parser runs, and delete any previously-added configuration associated with it.

DB-less doesn't have any way to persist configuration across container starts, but will result in inconsistent configuration across the cluster: newer replicas will enter service without the associated configuration until the problem condition fixes itself.

Legacy DB-less is the only variant that should start without API server access, which is what's happening in the Istio case. For a somewhat contrived demonstration, you can follow the API server kill instructions with a DB-less legacy instance running, use crictl ps to get the controller container ID, and crictl stop ID to force a restart without a scheduler running. Although the controller can technically start those controllers if the CRDs become available, the parser will happily run (and send empty configuration) regardless.

I haven't tried to replicate the situation with the API server running but unreachable (which could presumably allow Pods to enter into service), but there's enough general weirdness and problem behavior from the imperfect parser+store/controller relationship that I think we need to just fail early.

@pmalek
Copy link
Member

pmalek commented Sep 8, 2023

I have not tested every possible scenario for this but my gut tells me we should instead of not doing this, pay close attention to errors returned from the API server when e.g. looking up CRDs.

E.g. https:/Kong/gateway-operator/pull/1059 tries to do just that.

An explicit missing CRD (reported via a status code 404 with discovery.ErrGroupDiscoveryFailed error) is one thing: report a missing CRD and disable the controller or introduce a retry mechanism (like with the https:/Kong/kubernetes-ingress-controller/blob/ccafa7ca9da7fb52ba959c2ebbc0974e22497b5b/internal/controllers/crds/dynamic_controller.go).

A network error is another (tested with blocking traffic through iptables) which can be reported as url.Error error.

We could combine that with a retry backoff mechanism which would os.Exit(1) KIC when required CRDs (KIC's own) are not present. This way the end user gets a proper error (through means of CrashLoopBackOff on KIC's Pod) and can inspect the problem.


I realized I can do this in kind's container to block api server traffic:

docker exec -it colima-control-plane iptables -A INPUT -p tcp --destination-port 6443 -j DROP

then you can remove it via

docker exec -it colima-control-plane iptables -D INPUT -p tcp --destination-port 6443 -j DROP

This works when the api server runs on default - 6443 - port.

@czeslavo
Copy link
Contributor Author

czeslavo commented Sep 8, 2023

To simulate the network issues in code I have added an envtest TestManagerDoesntStartUntilKubernetesAPIReachable in which Kubernetes API server is proxied and can be interrupted.

After thinking more about it based on Travis' points, I think that the most robust way to prevent issues like #4603 and #4207 indeed would be to verify Kubernetes API connectivity on startup, before running any components (controllers, parser, etc.).

Having that + DynamicCRDController should cover all potential problems I can think of:

  1. Leader election on, controller runs, API server connection gets lost (like in Ingress controller deletes resources during control plane outage #4207) - KIC restarts, waits for API server connectivity, restarts until it succeeds.
  2. Leader election off, API server not reachable, controller starts up (like in Race condition with Istio sidecar prevents KIC to startup correctly #4603) - KIC restarts until it can connect to API server
  3. Controller starts up with API server connectivity, no CRDs installed at startup, but installed during runtime - DynamicCRDController starts the controllers on CRD installation.

We could combine that with a retry backoff mechanism which would os.Exit(1) KIC when required CRDs (KIC's own) are not present. This way the end user gets a proper error (through means of CrashLoopBackOff on KIC's Pod) and can inspect the problem.

Yeah, if we agree that all KIC CRDs are required then we could not use DynamicCRDController with them but just explicitly error out if they're not present.

@czeslavo
Copy link
Contributor Author

Closing this in favor of #4641.

@czeslavo czeslavo closed this Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/feature New feature or request size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support retry when looking up installed Kong CRDs on startup or implement DynamicCRDController for those
3 participants