Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness: add health status liveness probe sidecar #1665

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .commitlintrc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ rules:
- e2e
- helm
- journal
- liveness
- rbd
- rebase
- revert
Expand Down
6 changes: 4 additions & 2 deletions cmd/cephcsi.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,12 @@ func init() {
flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas")

// liveness/grpc metrics related flags
flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests")
flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong, it will likely miss any integer-formatting checks. Better keep it an int.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, I'm trying to save the conversion from int to string here, as the JoinHostPort() from net package expects port in string form.

Also see:
https:/kubernetes-csi/livenessprobe/blob/bba64df584a52d98ddf1904b7f8ec20a2828257c/cmd/livenessprobe/main.go#L39

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't it cause an upgrade issue? if someone just updated the image (which is the case for minor releases)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will cause any upgrade issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please check once?

flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available")
flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll")
flag.DurationVar(&conf.PoolTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ports are integers, not strings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my previous comments should answer this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need one more port? cant we just add one more endpoint to the metrics port?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is something I followed from the kube liveness probe project, its like giving a provision for users. I'm fine if we just want to use a single port for both metrics and health status.

Check: https:/ceph/ceph-csi/pull/1560/files#diff-e3217918d2c8805d4f5446edf4350a9ee17a8616deb3ce92352a83765c730db5R163

flag.StringVar(&conf.HealthzPath, "healthzpath", "/healthz", "path of liveness endpoint where health status will be available")

flag.BoolVar(&conf.EnableGRPCMetrics, "enablegrpcmetrics", false, "[DEPRECATED] enable grpc metrics")
flag.StringVar(&conf.HistogramOption, "histogramoption", "0.5,2,6",
Expand Down
5 changes: 4 additions & 1 deletion deploy/cephfs/kubernetes/csi-cephfsplugin-provisioner.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -148,10 +148,13 @@ spec:
image: quay.io/cephcsi/cephcsi:canary
args:
- "--type=liveness"
- "--v=5"
- "--endpoint=$(CSI_ENDPOINT)"
- "--metricsport=8681"
- "--metricspath=/metrics"
- "--polltime=60s"
- "--healthzport=9681"
- "--healthzpath=/healthz"
- "--polltime=15s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it "looks long" is not a good reason. Maybe you can look for other projects that do something like this and see what time they default to?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can always pool the CSIDriver when you get a request to get live status.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you mean probe? yes, we can always probe check.

As pointed other referenced projects are using 2sec and I feel 15 sec is a good time, but I will leave it up to the maintainers.

- "--timeout=3s"
env:
- name: CSI_ENDPOINT
Expand Down
5 changes: 4 additions & 1 deletion deploy/cephfs/kubernetes/csi-cephfsplugin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -99,10 +99,13 @@ spec:
image: quay.io/cephcsi/cephcsi:canary
args:
- "--type=liveness"
- "--v=5"
- "--endpoint=$(CSI_ENDPOINT)"
- "--metricsport=8681"
- "--metricspath=/metrics"
- "--polltime=60s"
- "--healthzport=9681"
- "--healthzpath=/healthz"
- "--polltime=15s"
- "--timeout=3s"
env:
- name: CSI_ENDPOINT
Expand Down
5 changes: 4 additions & 1 deletion deploy/rbd/kubernetes/csi-rbdplugin-provisioner.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -152,10 +152,13 @@ spec:
image: quay.io/cephcsi/cephcsi:canary
args:
- "--type=liveness"
- "--v=5"
- "--endpoint=$(CSI_ENDPOINT)"
- "--metricsport=8680"
- "--metricspath=/metrics"
- "--polltime=60s"
- "--healthzport=9680"
- "--healthzpath=/healthz"
- "--polltime=15s"
- "--timeout=3s"
env:
- name: CSI_ENDPOINT
Expand Down
5 changes: 4 additions & 1 deletion deploy/rbd/kubernetes/csi-rbdplugin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,13 @@ spec:
image: quay.io/cephcsi/cephcsi:canary
args:
- "--type=liveness"
- "--v=5"
- "--endpoint=$(CSI_ENDPOINT)"
- "--metricsport=8680"
- "--metricspath=/metrics"
- "--polltime=60s"
- "--healthzport=9680"
- "--healthzpath=/healthz"
- "--polltime=15s"
- "--timeout=3s"
env:
- name: CSI_ENDPOINT
Expand Down
1 change: 1 addition & 0 deletions docs/development-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ The `component` in the subject of the commit message can be one of the following

* `cephfs`: bugs or enhancements related to CephFS
* `rbd`: bugs or enhancements related to RBD
* `liveness`: bugs or enhancements related to Liveness
* `doc`: documentation updates
* `util`: utilities shared between components use `cephfs` or `rbd` if the
change is only relevant for one of the type of storage
Expand Down
101 changes: 82 additions & 19 deletions internal/liveness/liveness.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ package liveness

import (
"context"
"net"
"net/http"
"time"

"github.com/ceph/ceph-csi/internal/util"
Expand All @@ -29,6 +31,11 @@ import (
"google.golang.org/grpc"
)

type probeConn struct {
conn *grpc.ClientConn
config *util.Config
}

var (
liveness = prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: "csi",
Expand All @@ -37,57 +44,113 @@ var (
})
)

func getLiveness(timeout time.Duration, csiConn *grpc.ClientConn) {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
func (c *probeConn) checkProbe(w http.ResponseWriter, req *http.Request) {
ctx, cancel := context.WithTimeout(req.Context(), c.config.ProbeTimeout)
defer cancel()

util.TraceLogMsg("Sending probe request to CSI driver")
ready, err := rpc.Probe(ctx, csiConn)
util.TraceLog(ctx, "Healthz req: Sending probe request to CSI driver %s", c.config.DriverName)
ready, err := rpc.Probe(ctx, c.conn)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte(err.Error()))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't err value overwritten when you call w.Write()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will fix it.

return
}

if !ready {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}

util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")
Comment on lines +64 to +70
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this can go to helper function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

return
}

w.WriteHeader(http.StatusOK)
_, err = w.Write([]byte(`ok`))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ExtendedLog(ctx, "Healthz req: Health check succeeded")
}
Comment on lines +52 to +80
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ready, err := rpc.Probe(ctx, c.conn)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte(err.Error()))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)
return
}
if !ready {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")
return
}
w.WriteHeader(http.StatusOK)
_, err = w.Write([]byte(`ok`))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ExtendedLog(ctx, "Healthz req: Health check succeeded")
}
resp:="ok"
statuscode=http.StatusOK
ready, err := rpc.Probe(ctx, c.conn)
if err!=nil{
resp=err.Error()
statuscode=http.StatusInternalServerError
}else{
if !ready{
resp:="Healthz req: driver responded but is not ready"
statuscode=http.StatusInternalServerError
}
}
w.WriteHeader(statuscode)
_, err = w.Write([]byte(resp))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}


func getLiveness(c *probeConn) {
ctx, cancel := context.WithTimeout(context.Background(), c.config.ProbeTimeout)
defer cancel()

util.TraceLog(ctx, "Metrics req: Sending probe request to CSI driver: %s", c.config.DriverName)
ready, err := rpc.Probe(ctx, c.conn)
if err != nil {
liveness.Set(0)
util.ErrorLogMsg("health check failed: %v", err)
util.ErrorLog(ctx, "Metrics req: health check failed: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx logging is not much helpful as it doesn't contain any information on what we need

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will, it hurt if we have ctx?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't hurt, but as we have a separate function for logging without context let's use it.

return
}

if !ready {
liveness.Set(0)
util.ErrorLogMsg("driver responded but is not ready")
util.ErrorLog(ctx, "Metrics req: driver responded but is not ready")
return
}
liveness.Set(1)
util.ExtendedLogMsg("Health check succeeded")
util.ExtendedLog(ctx, "Metrics req: Health check succeeded")
}

func recordLiveness(endpoint, drivername string, pollTime, timeout time.Duration) {
liveMetricsManager := metrics.NewCSIMetricsManager(drivername)
func recordLiveness(c *probeConn) {
// register prometheus metrics
err := prometheus.Register(liveness)
if err != nil {
util.FatalLogMsg(err.Error())
}

csiConn, err := connlib.Connect(endpoint, liveMetricsManager)
if err != nil {
// connlib should retry forever so a returned error should mean
// the grpc client is misconfigured rather than an error on the network
util.FatalLogMsg("failed to establish connection to CSI driver: %v", err)
}

// get liveness periodically
ticker := time.NewTicker(pollTime)
ticker := time.NewTicker(c.config.PollTime)
defer ticker.Stop()
for range ticker.C {
getLiveness(timeout, csiConn)
getLiveness(c)
}
}

// Run starts liveness collection and prometheus endpoint.
func Run(conf *util.Config) {
util.ExtendedLogMsg("Liveness Running")

liveMetricsManager := metrics.NewCSIMetricsManager("")

csiConn, err := connlib.Connect(conf.Endpoint, liveMetricsManager)
if err != nil {
// connlib should retry forever so a returned error should mean
// the grpc client is misconfigured rather than an error on the network
util.FatalLogMsg("failed to establish connection to CSI driver: %v", err)
}

conf.DriverName, err = rpc.GetDriverName(context.Background(), csiConn)
if err != nil {
util.FatalLogMsg("failed to get CSI driver name: %v", err)
}
liveMetricsManager.SetDriverName(conf.DriverName)
util.ExtendedLogMsg("CSI driver: %s, Endpoint: %s", conf.DriverName, conf.Endpoint)

pc := &probeConn{
config: conf,
conn: csiConn,
}

// start liveness collection
go recordLiveness(conf.Endpoint, conf.DriverName, conf.PollTime, conf.PoolTimeout)
go recordLiveness(pc)

// start up prometheus endpoint
util.StartMetricsServer(conf)

address := net.JoinHostPort(conf.MetricsIP, conf.HealthzPort)
http.HandleFunc(conf.HealthzPath, pc.checkProbe)
util.ExtendedLogMsg("Serving Health requests on: http://%s%s", address, conf.HealthzPath)
err = http.ListenAndServe(address, nil)
Comment on lines +149 to +152
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are starting one more server. is it needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is only needed if we want to listen on separate ports.
For a single port and multiple paths like: '/metrics' and '/healthz', we can live with a single server

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to start one server not multiple servers

if err != nil {
util.FatalLogMsg("failed to listen on address %s: %v", address, err)
}
}
16 changes: 10 additions & 6 deletions internal/util/httpserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ import (
"net"
"net/http"
"net/url"
"strconv"

"github.com/prometheus/client_golang/prometheus/promhttp"
)
Expand All @@ -17,10 +16,15 @@ func ValidateURL(c *Config) error {

// StartMetricsServer starts http server.
func StartMetricsServer(c *Config) {
addr := net.JoinHostPort(c.MetricsIP, strconv.Itoa(c.MetricsPort))
addr := net.JoinHostPort(c.MetricsIP, c.MetricsPort)
http.Handle(c.MetricsPath, promhttp.Handler())
err := http.ListenAndServe(addr, nil)
if err != nil {
FatalLogMsg("failed to listen on address %v: %s", addr, err)
}
ExtendedLogMsg("Serving Metrics requests on: http://%s%s", addr, c.MetricsPath)

// Spawn a new go routine to listen on specified endpoint
go func() {
err := http.ListenAndServe(addr, nil)
if err != nil {
FatalLogMsg("failed to listen on address %v: %s", addr, err)
}
}()
}
6 changes: 4 additions & 2 deletions internal/util/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,11 @@ type Config struct {
HistogramOption string // Histogram option for grpc metrics, should be comma separated value, ex:= "0.5,2,6" where start=0.5 factor=2, count=6
MetricsIP string // TCP port for liveness/ metrics requests
PidLimit int // PID limit to configure through cgroups")
MetricsPort int // TCP port for liveness/grpc metrics requests
MetricsPort string // TCP port for liveness/grpc metrics requests
HealthzPort string // TCP port for liveness/health requests
HealthzPath string // path for liveness/health requests
PollTime time.Duration // time interval in seconds between each poll
PoolTimeout time.Duration // probe timeout in seconds
ProbeTimeout time.Duration // probe timeout in seconds
EnableGRPCMetrics bool // option to enable grpc metrics

IsControllerServer bool // if set to true start provisoner server
Expand Down