feat: use DynamicCRDController with Kong controllers #4619

czeslavo · 2023-09-06T21:04:33Z

What this PR does / why we need it:

Use DynamicCRDController instead of a single ShouldEnableCRDController call on startup to decide whether Kong CRDs' controllers should be run. This should fix the issue of these controllers not being started in a scenario where there's a temporary API server connection problem (e.g. because of waiting for Istio sidecar to startup).

Which issue this PR fixes:

Should fix #4618.

Special notes for your reviewer:

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

codecov · 2023-09-06T21:18:11Z

Codecov Report

Patch coverage: 85.4% and project coverage change: +0.1% 🎉

Comparison is base (0e04e18) 68.0% compared to head (21f069b) 68.1%.
Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #4619     +/-   ##
=======================================
+ Coverage   68.0%   68.1%   +0.1%     
=======================================
  Files        163     163             
  Lines      19091   19067     -24     
=======================================
+ Hits       12992   12996      +4     
+ Misses      5331    5298     -33     
- Partials     768     773      +5

Files Changed	Coverage Δ
internal/cmd/rootcmd/run.go	`19.0% <0.0%> (-2.1%)`	⬇️
internal/manager/conditions.go	`89.6% <ø> (-2.1%)`	⬇️
internal/controllers/crds/dynamic_controller.go	`73.8% <33.3%> (+56.3%)`	⬆️
internal/manager/run.go	`54.2% <69.4%> (+0.8%)`	⬆️
internal/manager/controllerdef.go	`98.7% <100.0%> (-0.2%)`	⬇️
internal/manager/setup.go	`75.5% <100.0%> (ø)`

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rainest · 2023-09-07T22:58:18Z

This will still let routes go missing during the problem period.

Disabling the controllers doesn't disable the accompanying section of the parser, so in DB mode the instance will come up, disable some resource controller, see 0 such resources listed when the parser runs, and delete any previously-added configuration associated with it.

DB-less doesn't have any way to persist configuration across container starts, but will result in inconsistent configuration across the cluster: newer replicas will enter service without the associated configuration until the problem condition fixes itself.

Legacy DB-less is the only variant that should start without API server access, which is what's happening in the Istio case. For a somewhat contrived demonstration, you can follow the API server kill instructions with a DB-less legacy instance running, use crictl ps to get the controller container ID, and crictl stop ID to force a restart without a scheduler running. Although the controller can technically start those controllers if the CRDs become available, the parser will happily run (and send empty configuration) regardless.

I haven't tried to replicate the situation with the API server running but unreachable (which could presumably allow Pods to enter into service), but there's enough general weirdness and problem behavior from the imperfect parser+store/controller relationship that I think we need to just fail early.

pmalek · 2023-09-08T11:25:29Z

I have not tested every possible scenario for this but my gut tells me we should instead of not doing this, pay close attention to errors returned from the API server when e.g. looking up CRDs.

E.g. https:/Kong/gateway-operator/pull/1059 tries to do just that.

An explicit missing CRD (reported via a status code 404 with discovery.ErrGroupDiscoveryFailed error) is one thing: report a missing CRD and disable the controller or introduce a retry mechanism (like with the https:/Kong/kubernetes-ingress-controller/blob/ccafa7ca9da7fb52ba959c2ebbc0974e22497b5b/internal/controllers/crds/dynamic_controller.go).

A network error is another (tested with blocking traffic through iptables) which can be reported as url.Error error.

We could combine that with a retry backoff mechanism which would os.Exit(1) KIC when required CRDs (KIC's own) are not present. This way the end user gets a proper error (through means of CrashLoopBackOff on KIC's Pod) and can inspect the problem.

I realized I can do this in kind's container to block api server traffic:

docker exec -it colima-control-plane iptables -A INPUT -p tcp --destination-port 6443 -j DROP

then you can remove it via

docker exec -it colima-control-plane iptables -D INPUT -p tcp --destination-port 6443 -j DROP

This works when the api server runs on default - 6443 - port.

czeslavo · 2023-09-08T13:04:42Z

To simulate the network issues in code I have added an envtest TestManagerDoesntStartUntilKubernetesAPIReachable in which Kubernetes API server is proxied and can be interrupted.

After thinking more about it based on Travis' points, I think that the most robust way to prevent issues like #4603 and #4207 indeed would be to verify Kubernetes API connectivity on startup, before running any components (controllers, parser, etc.).

Having that + DynamicCRDController should cover all potential problems I can think of:

Leader election on, controller runs, API server connection gets lost (like in Ingress controller deletes resources during control plane outage #4207) - KIC restarts, waits for API server connectivity, restarts until it succeeds.
Leader election off, API server not reachable, controller starts up (like in Race condition with Istio sidecar prevents KIC to startup correctly #4603) - KIC restarts until it can connect to API server
Controller starts up with API server connectivity, no CRDs installed at startup, but installed during runtime - DynamicCRDController starts the controllers on CRD installation.

We could combine that with a retry backoff mechanism which would os.Exit(1) KIC when required CRDs (KIC's own) are not present. This way the end user gets a proper error (through means of CrashLoopBackOff on KIC's Pod) and can inspect the problem.

Yeah, if we agree that all KIC CRDs are required then we could not use DynamicCRDController with them but just explicitly error out if they're not present.

czeslavo · 2023-09-11T14:57:50Z

Closing this in favor of #4641.

czeslavo self-assigned this Sep 6, 2023

pull-request-size bot added the size/L label Sep 6, 2023

czeslavo added area/feature New feature or request and removed size/L labels Sep 6, 2023

czeslavo force-pushed the dynamic-crd-for-all branch from 335b011 to 7d930ff Compare September 6, 2023 21:16

pull-request-size bot added the size/L label Sep 6, 2023

czeslavo force-pushed the dynamic-crd-for-all branch from 7d930ff to a4e50d3 Compare September 6, 2023 21:38

pull-request-size bot added size/XL and removed size/L labels Sep 8, 2023

czeslavo force-pushed the dynamic-crd-for-all branch from d0ade1f to 7d45ef9 Compare September 8, 2023 12:48

czeslavo force-pushed the dynamic-crd-for-all branch from 95115bb to 0122c0e Compare September 11, 2023 09:45

czeslavo mentioned this pull request Sep 11, 2023

chore: don't use global controller-runtime logger #4639

Merged

czeslavo added 4 commits September 11, 2023 14:53

feat: use DynamicCRDController with Kong controllers

84877ff

wait for Kubernetes API before proceeding with mgr startup

b66db86

parse url

ddf43f5

require kong CRDs on startup

3e23fbf

czeslavo force-pushed the dynamic-crd-for-all branch from 0122c0e to 2c2fdf6 Compare September 11, 2023 13:09

rely on controller-runtime for kong crds

21f069b

czeslavo force-pushed the dynamic-crd-for-all branch from 2c2fdf6 to 21f069b Compare September 11, 2023 13:16

czeslavo closed this Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use DynamicCRDController with Kong controllers #4619

feat: use DynamicCRDController with Kong controllers #4619

czeslavo commented Sep 6, 2023 •

edited

Loading

codecov bot commented Sep 6, 2023 •

edited

Loading

rainest commented Sep 7, 2023

pmalek commented Sep 8, 2023

czeslavo commented Sep 8, 2023 •

edited

Loading

czeslavo commented Sep 11, 2023

feat: use DynamicCRDController with Kong controllers #4619

feat: use DynamicCRDController with Kong controllers #4619

Conversation

czeslavo commented Sep 6, 2023 • edited Loading

codecov bot commented Sep 6, 2023 • edited Loading

Codecov Report

rainest commented Sep 7, 2023

pmalek commented Sep 8, 2023

czeslavo commented Sep 8, 2023 • edited Loading

czeslavo commented Sep 11, 2023

czeslavo commented Sep 6, 2023 •

edited

Loading

codecov bot commented Sep 6, 2023 •

edited

Loading

czeslavo commented Sep 8, 2023 •

edited

Loading