Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KIC 2.0 final controller mananager flag checkup #1580

Closed
4 tasks done
shaneutt opened this issue Jul 22, 2021 · 12 comments
Closed
4 tasks done

KIC 2.0 final controller mananager flag checkup #1580

shaneutt opened this issue Jul 22, 2021 · 12 comments
Assignees
Labels
area/maintenance Cleanup, refactoring, and other maintenance improvements that don't change functionality. priority/high

Comments

@shaneutt
Copy link
Contributor

shaneutt commented Jul 22, 2021

Is there an existing issue for this?

  • I have searched the existing issues

Problem Statement

Before we release the KIC 2.0 beta we need to do one last pass over all the controller related options for the manager, in particular the enablement settings available for each and either change the behavior or ensure consistency of the behavior using the 3 setting approach.

Additional information

No response

Acceptance Criteria

  • as a contributor clear and present documentation exists which defines the configuration options for enabling and disabling specific controllers
  • as an end-user the flags for enabling/disabling controllers are documented and easy to understand
  • as an end-user the flags for enabling/disabling controllers are consistent across API types (or otherwise strongly documented where exceptions were made)
@shaneutt shaneutt added priority/high area/maintenance Cleanup, refactoring, and other maintenance improvements that don't change functionality. labels Jul 22, 2021
@mflendrich
Copy link
Contributor

What keeps me unhappy about

make controller enablement flags boolean

is that this way we'll lose the distinction between "fail if desired but not working" and "ignore silently if not working".

@shaneutt can you think of a solution that satisfies the reason behind your intention to switch from enabled/auto/disabled into booleans and my concern?

@shaneutt
Copy link
Contributor Author

What keeps me unhappy about

make controller enablement flags boolean

is that this way we'll lose the distinction between "fail if desired but not working" and "ignore silently if not working".

@shaneutt can you think of a solution that satisfies the reason behind your intention to switch from enabled/auto/disabled into booleans and my concern?

Would you mind please describing the problem to solve in terms of acceptance criteria (end result, UX, e.t.c.) so that I can try to make adjustments to satisfy it?

@mflendrich
Copy link
Contributor

Acceptance criteria that come to my mind are:

  • allow the user to control (or, at the very least, to clearly understand) whether they're enabling a controller unconditionally (so that it will be treated as a runtime error if the controller fails to start up) or conditionally (so that external factors, such as no presence of a CRD, will dynamically and silently suppress enablement)

@shaneutt
Copy link
Contributor Author

I would argue that making these booleans, and then failing the program entirely when true but the API is missing would satisfy:

  • as an end-user I can control whether or not to enable a controller
  • as an end-user I can understand when a controller I have enabled fails due to a missing API

Which to me seems to be along the same lines are you're suggesting. I'm personally not clear on the value of "maybe enable this if" from the end-user perspective, though I'm trying to remain open to suggestion. In the future though, I can see the value of "eventually" (as opposed to maybe) enable this, re: #1449

I want to see the behavior when KIC 2.0 launches to be clear and crisp: you either want a controller/api or you don't. And if you do, but the API isn't available in the cluster we signal red alert and action needs to be taken.

@shaneutt
Copy link
Contributor Author

I figured I could potentially communicate better via code, here's a draft PR to illustrate and use as a conversation basis:

#1582

It was quick and easy to make, so we can iterate on it, close it, e.t.c. whatever seems to fit best.

@shaneutt
Copy link
Contributor Author

Given a zoom conversation with @mflendrich I've removed the original proposed solution from the description and it can instead be seen in #1582 (which I think we're ultimately going to decline as a solution, but for posterity).

@rainest
Copy link
Contributor

rainest commented Jul 27, 2021

While I think the technical approach to AutoHandlers in #1585, I do think the option to change to booleans makes more sense for UX.

On our end, we want to be opportunistic with optional functionality. Not everyone has Knative installed, but for those users that do, we want to make our support readily available. We default its controller on as such, but that introduces an implementation challenge, since loading it when Knative isn't installed results in a startup failure.

Failing to start with a default configuration in the many clusters that don't have Knative is bad UX, so we instead use feature detection and degrade gracefully, disabling it automatically. We can reasonably assume that users who don't have Knative installed are okay with that controller not loading (after all, it wouldn't do anything if it did). We're not ignoring user intent, since attempting to load that controller is a default behavior behavior: the user hasn't explicitly instructed us to load it, we just try to be opportunistic. We furthermore do log what actually happens, so this isn't silent.

There shouldn't be a case where you need to override the auto-loader and force us to load the controller, since it just results in a startup error. If we did provide the option, changing from auto to enabled when your cluster doesn't have the required CRD effectively functions as a --do-not-start flag, which is of little use. You shouldn't need the controller to tell you that you don't meet the requirements, since that should be obvious otherwise--you would not, for example, be confused about the controller not loading Knative Ingresses if you don't have it installed, since you also wouldn't be able to create any Knative CRs in the first place.

Booleans are simpler to understand and allow us to present a more uniform config for all controllers: each only has the single option (disable it) and the behavior is consistent across all controllers (the controller just doesn't load). Auto is more harder to understand since it's not available everywhere (most controllers just don't support it) and is handled differently when it is available (the Knative/KCP auto-disable versus the Ingress version negotiation).

@shaneutt
Copy link
Contributor Author

I do think the option to change to booleans makes more sense for UX.

I do tend to agree with @rainest on this point. And to add on that we can always add options in the future to add specific "automatic" controller features, but more explicitly and only when we're sure we need them. We're committed to more than I think we need to be in 2.0 with having this flag represent 3 possible states, and one of them reflecting "automatic setup" which I'm just not sure I feel is clearly defined enough yet.

@mflendrich
Copy link
Contributor

I agree with the general spirit of your messages @rainest @shaneutt (that a boolean would make for optimal UX) but the devil is in the details:

  • If we meld "auto" with "enabled" today (and we don't have the "auto" logic for most of the controllers now), the UX will be such that for some controllers "enabling" will fail softly if the appropriate resource is not present in the cluster; for other controllers (that don't have the logic implemented today - I assume that we'd be using an "always enable" stub there) it will be a failure: either the controller perpetually spamming with errors, or failing to start up at all.
  • "soft failure" on controller initialization is an unfortunate failure mode as it can mean data loss in Kong on controller upgrade. Suppose that for some minute reason (e.g. the CRD has a wrong apiversion - incompatible with KIC) KIC decides not to enable some controller. This will lead to the proxy sync mechanism eventually wiping the part of config that was put there by a previous vesion of KIC from Kong, causing unintended misconfiguration.

I think that the cleanup happening in #1585 should be uncontroversial (regardless whether we go for binary or ternary enablement statuses, we need the logic across controllers to be consistent, and 1585 brings that consistency to AutoHandlers). If we opt for switching from ternary to binary, that could be technically implemented by dropping the enabled path, and having an AutoHandler (maybe appropriately renamed) for every controller - possibly a degenerated return true AutoHandlre.

@rainest
Copy link
Contributor

rainest commented Jul 29, 2021

some controllers "enabling" will fail softly ... for other controllers ... it will be a failure

IMO this is fine. We have inconsistency somewhere--either we have controllers that behave differently when you set enabled or we have controllers that don't allow you to set auto (or if we add it to all, where setting auto has no useful purpose, and simply lets you hit the problem scenario in the second bullet). Absent a perfectly consistent solution that handles the split between controllers we definitely want and controllers we maybe want, I'd lean towards the more consistent config surface, since that's what most users will see.

can mean data loss in Kong on controller upgrade

Reasonable, though we should have a better means of alerting users to this before they attempt an upgrade and it fails to complete/leaves the old ReplicaSet running. Deprecation warnings can go unheeded, and a global "fail if CRDs missing" flag isn't really going to help if we use an ignore-if-absent AutoHandler throughout (there's no way it plays nice with the CRDs we actually want want to ignore when missing).

For controllers where we don't ever expect the CRD to be absent, we get a solution (albeit the rather clunky "upgrade fails" solution) for free if we never add a true AutoHandler. For those where we want to maybe load, I'm not sure what a good solution is.

the cleanup happening in #1585 should be uncontroversial (regardless whether we go for binary or ternary enablement statuses ...

Agreed--that sounds fine; we can make the refactor to make the AutoHandler function a controller property rather than special cases in setupControllers() and condense the setting to a boolean after, so will proceed with reviewing that.

@rainest
Copy link
Contributor

rainest commented Aug 3, 2021

Rephrasing some of my responses from earlier since hypothetical UX stuff isn't always the easiest to parse and benefits from different explanations, and because I've had more time to think over it:

Enabling can result in a soft failure

I'd argue this shouldn't be framed as a failure, at least from a user perspective. In a strict technical sense we fail to load an enabled controller, but that occurs inside our implementation, and you, the user don't care.

The net effect of our failing to load the Knative controller is that we cannot generate configuration from any Knative resources. This is fine! There are no Knative resources to load! You can't possibly have created any, because you haven't installed the CRDs necessary to create them.

Change in supported controllers results in unexpected configuration change

While a valid concern that requires a well-explained upgrade path and prominent warnings, this doesn't occur because an enabled controller can't see its CRD and doesn't load. This occurs when we remove support for an API version, e.g. we've created a KongPlugin v2, are removing support for v1, and you haven't yet migrated your resources.

Upgrading to a version that only supports v2 will drop all of your plugin configuration, but that doesn't happen because auto-detection doesn't see the v1 CRD, it happens because we no longer try to load the v1 controller at all.

If we do have an opportunistic v1 controller loader still, and you still have v1 configuration, the opportunistic controller will load: by virtue of having v1 resources, you must have the v1 CRD installed, and will thus satisfy the auto-loader conditions.

The murky area is version negotiation, where we try to load the newest available version of an API and ignore older versions. I'm not 100% clear on what should happen if you have an environment that has both v1 and v2 of a CRD available: at least with the Ingress example we have now, it seems like we try to load the newest enabled controller and will ignore configuration for older API versions.

You end up in a bad place if you have v2 available, don't auto-upgrade v1 resources to v2 storage, and we load the v2 controller only. I don't think that's any different from the all-auto scenario--we still only look at the newest available version, and if you have both v2 CRDs installed and v1 CRs present, we try and load v2 CRs, see none, ignore your v1 CRs, and you lose config. The remedy here is that you explicitly disable the opportunistic v2 CR loader, load v1 as the newest enabled version instead, and continue to load your v1 CRs.

@mflendrich
Copy link
Contributor

The murky area is version negotiation, where we try to load the newest available version of an API and ignore older versions

This problem is going away soon: #1666

I'm not 100% clear on what should happen if you have an environment that has both v1 and v2 of a CRD available: at least with the Ingress example we have now, it seems like we try to load the newest enabled controller and will ignore configuration for older API versions.

For all 3 versions of Ingress we currently support, the apiserver will present the exact same set of Ingress resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/maintenance Cleanup, refactoring, and other maintenance improvements that don't change functionality. priority/high
Projects
None yet
Development

No branches or pull requests

3 participants