Skip to content

Kepler Operator Requirement

Kaiyi edited this page Sep 27, 2022 · 8 revisions

Requirement

The operator should be able to probe the cluster and nodes to ensure Kepler is running on a supported environment and starts up with the right configuration.

After Kepler is up, the operator should integrate with Prometheus and Grafana to create a ServiceMonitor and Grafana dashboard, in accordance with the CRD spec.

Cluster Probe

The Operator will probe the nodes and resolve dependency, install the following pkg if missing (if not possible, avoid using those nodes):

  • Kernel-devel
  • Cgroup

CRD Spec

The CRD specifies the following:

  • Kepler deployment
  • RBAC, deployment configuration (including whether using /proc (for cgroup v1), the model server endpoint, whether use estimator), metrics Service
  • Kepler Integration
  • ServiceMonitor, Grafana instance, datasource, dashboard

minimum scope:

  • just service, deployment etc, for kepler to ensure user able to set up kepler on their own cluster by a sample kubectl apply -f
  • for any port and k8s resource as cluster role permission we'd better defined in manifests.
  • we'd better don't have any permission as cluster role binding at the minimum scope.

The document of minimum scope will guide developer to develop and configuration kepler deployment (created by operator) with any kind of other tools on observability, service mesh, disk, key management and so on.

Extendable: Considering with extendable with other tools, take prometheus operator as sample, we can define some specific fields/properties in CRD for extendable. Any cluster role binding used to integrated with tools out of kepler code scope should be here. (for example cluster role binding for service monitoring)

  • monitoring: prometheus
  • distributed tracing: jaeger/OTEL
  • logging: ELK? optional:
  • cert management operator? for (m)tls?
  • service mesh?

any of able extendable should base on minimum scope, for example port setting. and free for request as github issue for new tools integration.

Summary of Discussion

  1. Operator Scope there will be two broad scopes.

    • kepler-system
      • collector
      • estimator (disable by default)
      • model-server (disable by default)
    • add-ons (kepler-defined CRs of external systems)
  2. Current focus systems to support

    • Underlying Node
      • Intel Architecture
      • Bare metal
      • cgroupv1/cgroupv2
      • with/without RAPL
    • Cluster
      • OpenShift
      • Plain Kubernetes
  3. Design to support

    • Support multiple underlying CPU Architecture
      • auto-discovered by
        • Node resource info
        • (discovery daemonset if need, reference)
      • blind arch-specific tag from user with latest tag? (pull arch-specific tag to node-local registry and tag with common tag such as latest)
  4. Update Kepler CR

How should the new api look like? Should we define a group for KeplerSystem and a group for KeplerAddons and then define corresponding controllers for Kepler System and Kepler Addons?