Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sig-api-machinery: Add scale targets to CRDs to GA KEP #1015

Merged
merged 3 commits into from
Jul 29, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 112 additions & 12 deletions keps/sig-api-machinery/20180415-crds-to-ga.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ see-also:
- "[Umbrella Issue](https:/kubernetes/kubernetes/issues/58682)"
- "[Vanilla OpenAPI Subset Design](https://docs.google.com/document/d/1pcGlbmw-2Y0JJs9hsYnSBXamgG9TfWtHY6eh80zSTd8)"
- "[Pruning for CustomResources KEP](https:/kubernetes/enhancements/pull/709)"
- "[Defaulting for Custom Resources KEP](https:/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190426-crd-defaulting.md)"
---

# Title
Expand Down Expand Up @@ -95,21 +96,21 @@ Bug fixes required to graduate CRDs to GA:

* See “Required for GA” issues tracked via the [CRD Project Board](https:/orgs/kubernetes/projects/28).

For additional details on already completed features, see the [Umbrella Issue](https:/kubernetes/kubernetes/issues/58682).
For additional details on already completed features, see the [CRD Project Board](https:/orgs/kubernetes/projects/28).

See [Post-GA tasks](#post-ga-tasks) for decided out-of-scope features.

### Defaulting and pruning for custom resources is implemented

Both defaulting and pruning and also read-only validation are blocked by the
OpenAPI subset definition (next point). An update of the [old Pruning for
CustomResources KEP](https:/kubernetes/enhancements/pull/709) and the implementation
([pruning PR](https:/kubernetes/kubernetes/pull/64558), [defaulting
PR](https:/kubernetes/kubernetes/pull/63604)), are follow-ups as soon as unblocked.
OpenAPI subset definition (next point).

See the [Pruning for CustomResources KEP](https:/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20180731-crd-pruning.md)
and the [Defaulting for Custom Resources KEP](https:/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190426-crd-defaulting.md).

### CRD v1 schemas are restricted to a subset of the OpenAPI specification

See [Vanilla OpenAPI Subset Design](https://docs.google.com/document/d/1pcGlbmw-2Y0JJs9hsYnSBXamgG9TfWtHY6eh80zSTd8)
See [OpenAPI Subset KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190425-structural-openapi.md)

### Generator exists for CRD Validation Schema v3 (Kubebuilder)

Expand All @@ -121,8 +122,8 @@ to be integrated into kubebuidler’s controller-tools.

### CustomResourceWebhookConversion API is GA ready

Currently CRD webhook conversion is alpha. We plan to take this to v1beta1 via the
"Graduation Criteria" proposed in [PR #1004](https:/kubernetes/enhancements/pull/1004).
Currently CRD webhook conversion is alpha. We plan to take this to v1beta1 according to the
[CustomResourceDefinition Conversion Webhook's Graduation Criteria](https:/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190425-crd-conversion-webhook.md#graduation-criteria).
We plan to then graduate this to GA as part of the CRD to GA graduation.

### CustomResourceSubresources API is GA ready
Expand Down Expand Up @@ -162,10 +163,109 @@ TODO: complete this list

### Scale Targets for GA

* TODO quantify: Read/write latency of CRDs within X% of native Kubernetes types
* TODO quantify: Latency degrades less than X% for up to 100k Custom Resources per CRD kind
* TODO quantify: Webhook conversion QPS of a noop converter is within X% of QPS with no webhook
* Coordinate with sig-scalability
The scale targets for GA of custom resources are defined by the same [API call latency
SLIs/SLOs as the Kuberetes native types](https:/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md#api-call-latency-slisslos-details).

The targets are defined by the below suggested maximum limits, which are organized the same way as the [Kubernetes native type thresholds](https:/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md#kubernetes-thresholds), with only one change:

- Since custom resources can be arbitrarily large, we have broken down the limit by custom resource object size.

**Custom Resource Definitions:**

| Suggested Maximum Limit: scope=cluster |
| --- |
| 500 |

_Note: The Custom Resource Definition suggested maximum limit was selected not
due to the above SLI/SLOs, but instead due to the latency OpenAPI publishing,
which is a background process that occurs asychroniously each time a Custom
Resource Definition schema is updated. For 500 Custom Resource Definitions it takes
slightly over 35 seconds for a definition change to be visible via the OpenAPI
spec endpoint._

**Custom Resources, Cluster Wide:**

Cluster wide limits for custom resources are storage bound and custom resources
share the storage space with all other objects. While determining the
appropriate storage limit for a cluster is out-of-scope for this document, once
a etcd storage limit selected, suggested maximum limits for custom resources
are:

| etcd storage limit | Suggested Maximum Limit: scope=cluster |
| --- | --- |
| 4GB | 40000 |
| 8GB | 80000 |

These limits aim to keep custom resource storage usage to less than half of the
total cluster storage capacity for custom resources of 50kb or less in size.

**Custom Resources per Definition:**

For each custom resource definition, the limit on the number of custom resources
can be found by taking the (median) object size of the custom resource and finding
the the matching row in this table:

| Object size | Suggested Maximum Limit: scope=namespace (5s p99 SLO) | Suggested Maximum Limit: scope=cluster (30s p99 SLO) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the cluster scope have a longer SLO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to be consistent with how the SLOs for native types are defined (https:/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md#api-call-latency-slisslos-details). @wojtek-t, do you know the background?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's connected with the number of objects - listing objects is basically some constant time + something proportional to number of objects processed. And within single namespace (scope=namespace) we offficially allow much smaller number of objects.

| --- | --- | --- |
| <=10kb | 1500 | 10000 |
| (10kb - 25kb] | 600 | 4000 |
| (25kb - 50kb] | 300 | 2000 |

The cluster scope indicates the total number of custom resources for that
definition allowed in the entire cluster.
lavalamp marked this conversation as resolved.
Show resolved Hide resolved

The namespace scope indicates the total number of custom resources for that
definition allowed in any particular namespace. The cumulative count of the
custom resource across all namespaces must not exceed the cluster limit.

Since, in practice, custom resources scale farther without conversion webhooks
within the SLI/SLOs (roughly 2x according to our scale tests), custom resource
definition authors should be careful to adhere to these limits so that in the
future a webhook converter may safely be added as part of a custom resource
version upgrade.

_Note: For custom resources of custom resource definitions using `scope: Namespaced`: the scope=namespace
suggested maximum limit indicates how many custom resource objects may be in each namespace,
and the scope=cluster suggested maximum limit indicates how many custom resource objects may
be in the cluster total. For custom resources of custom resource definitions using `scope: Cluster`: only
the scope=cluster suggested maximum limit applies._

**Conversion Webhooks:**

Conversion Webhook SLOs are defined from the perspective of the conversion
webhook. It does not include any api-server serialization/deserialization for
making the request to the webhook, but it does include network latency.

Given that the performance and scalability of conversion webhooks are the
responsibility of their author, Custom resource scale targets are applied only for
conversion webhooks that are within the following latencies for the above suggested
maximum limits.
jpbetz marked this conversation as resolved.
Show resolved Hide resolved

| scope | object count limit | Expected conversion Webhook SLO: p99 latency |
| --- | --- | --- |
| resource | 1 | 50ms |
| namespace | 1500 (<=10kb), 600 (10-25kb) or 300 (25-50kb) | 1 seconds |
| cluster | 10000 (<=10kb), 4000 (10-25kb) or 2000 (25-50kb) | 6 seconds |

The scope=resource's higher "per-object" latency (50ms vs ~1.5ms for namespace
and cluster scope) is to accommodate for a request serving cost constant.

The above object size and suggested maximum limits in the Custom Resources per
Definition table applies to these conversion webhook SLOs. For example, for a
list request for 1500 custom resource objects that are 10k in size, the resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the table trying to say "return a single object in 50ms, up to 1500 10k objects in 1s, 10000 10k objects in 6 seconds"?

If so, that mathematically doesn't make much sense, the latter implies that a single object must be processed in .6ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've inlined the object count details in the webhook SLOs to make it easier to understand. The 50ms for a single object is higher to account for a request serving constant factor. I've added a note.

scope SLO of 1 second for the conversion webhook applies.

**Scale Target Data**

GA custom resource scale targets were selected based on an [analysis of our current scale limits](https://docs.google.com/document/d/1tEstPQvzGvaRnN-WwGUWx1H9xHPRCy_fFcGlgTkB3f8).

We ran a month long survey of Custom Resource Definition scale needs across Kubernetes mailing lists, slack channels and social media.
Of the custom resource definitions surveyed, 96% are currently within these suggested maximum limits, 91% are within these limits for their anticipated future growth, and survey data provides useful guidance for our post-GA scalability work. See [survey of real-world custom resource usage](https://docs.google.com/document/d/1MTd_gDlpgBaT5sAKM4j6tQVeCFIT9J44RHzt2yWOK_g) for details.

As part of GA the suggested maximum limits and SLO documentation will be updated to make this clear and to
encourage CRD authors to provide concrete suggested maximum limits and SLIs/SLOs for their custom
resource kinds to their users that account for the per resource conversion cost
of their conversion webhook and/or size of their custom resources.

## Graduation Criteria
jpbetz marked this conversation as resolved.
Show resolved Hide resolved

Expand Down