liveness: add health status liveness probe sidecar #1665

pkalever · 2020-11-04T20:35:09Z

Describe what this PR does

The health status liveness probe shares and runs within the liveness-prometheus container. The health status liveness probe listen and serve requests at a dedicated port and path. By default they listen at '/healthz' path and '9680' port, which can be easily configurable.

Useful logs:

[0] pkalever 😎 ceph-csi✨ kubectl logs csi-rbdplugin-5dzrn liveness-prometheus                                   
I1104 20:16:06.923915  212212 cephcsi.go:125] Driver version: canary and Git version: 59aff136b51f38bdbca65bb0782215b7b786f89b
I1104 20:16:06.924212  212212 cephcsi.go:171] Starting driver type: liveness with name: liveness.csi.ceph.com     
I1104 20:16:06.924263  212212 liveness.go:122] Liveness Running                  
I1104 20:16:06.924458  212212 httpserver.go:21] Serving Metrics requests on: http://192.168.39.188:8680/metrics   
I1104 20:16:06.924555  212212 liveness.go:133] Serving Health requests on: http://192.168.39.188:9680/healthz
I1104 20:16:06.924739  212212 connection.go:153] Connecting to unix:///csi/csi.sock                               
I1104 20:16:06.927573  212212 liveness.go:110] CSI driver: "rbd.csi.ceph.com", Endpoint: unix:///csi/csi.sock
I1104 20:16:21.928003  212212 liveness.go:73] Metrics req: Sending probe request to CSI driver: "rbd.csi.ceph.com"
I1104 20:16:21.930418  212212 liveness.go:87] Metrics req: Health check succeeded
[...]
I1104 20:17:21.931447  212212 liveness.go:87] Metrics req: Health check succeeded
I1104 20:17:36.928015  212212 liveness.go:73] Metrics req: Sending probe request to CSI driver: "rbd.csi.ceph.com"
I1104 20:17:36.930653  212212 liveness.go:87] Metrics req: Health check succeeded
I1104 20:17:48.331070  212212 liveness.go:48] Healthz req: Sending probe request to CSI driver "rbd.csi.ceph.com" 
I1104 20:17:48.332878  212212 liveness.go:66] Healthz req: Health check succeeded

useful commands:

# kubectl exec -it  csi-rbdplugin-5dzrn -c csi-rbdplugin -- curl -X GET http://192.168.39.188:8680/metrics
# kubectl exec -it  csi-rbdplugin-5dzrn -c csi-rbdplugin -- curl -X GET http://192.168.39.188:9680/healthz

Also,
# minikube ssh
# docker ps | grep csi-rbdplugin-
# docker stop <contaner-ID>

Fixes: #1096

Signed-off-by: Prasanna Kumar Kalever [email protected]

verbose flag is missing for all liveness prometheus containers, this patch adds verbosity 5 to all liveness conatiners Signed-off-by: Prasanna Kumar Kalever <[email protected]>

s/PoolTimeout/ProbeTimeout/ Signed-off-by: Prasanna Kumar Kalever <[email protected]>

MetricIP is consumed as string, hence collecting it as int doesn't make anysense, which will involve type converstion later. Signed-off-by: Prasanna Kumar Kalever <[email protected]>

nixpanic · 2020-11-05T10:57:21Z

cmd/cephcsi.go

@@ -66,7 +66,7 @@ func init() {
 flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas")

 // liveness/grpc metrics related flags
- flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests")
+ flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests")


I think this is wrong, it will likely miss any integer-formatting checks. Better keep it an int.

Technically yes, I'm trying to save the conversion from int to string here, as the JoinHostPort() from net package expects port in string form.

Also see:
https:/kubernetes-csi/livenessprobe/blob/bba64df584a52d98ddf1904b7f8ec20a2828257c/cmd/livenessprobe/main.go#L39

won't it cause an upgrade issue? if someone just updated the image (which is the case for minor releases)

I don't think it will cause any upgrade issues.

can we please check once?

nixpanic · 2020-11-05T10:59:38Z

internal/liveness/liveness.go

@@ -35,10 +35,12 @@ var (
 Name: "liveness",
 Help: "Liveness Probe",
 })
+ csiConn *grpc.ClientConn


I am not convinced making this global is a good thing. Maybe it is better to have a config type that can be used by calling methods on it.

OK. I will look, how to make it better.

nixpanic · 2020-11-05T11:01:48Z

cmd/cephcsi.go

@@ -70,6 +70,8 @@ func init() {
 flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available")
 flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll")
 flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
+ flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests")


ports are integers, not strings

Yeah, my previous comments should answer this.

why we need one more port? cant we just add one more endpoint to the metrics port?

OK, this is something I followed from the kube liveness probe project, its like giving a provision for users. I'm fine if we just want to use a single port for both metrics and health status.

Check: https:/ceph/ceph-csi/pull/1560/files#diff-e3217918d2c8805d4f5446edf4350a9ee17a8616deb3ce92352a83765c730db5R163

nixpanic · 2020-11-05T11:08:36Z

deploy/cephfs/kubernetes/csi-cephfsplugin-provisioner.yaml

@@ -154,7 +154,7 @@ spec:
 - "--metricspath=/metrics"
 - "--healthzport=9681"
 - "--healthzpath=/healthz"
- - "--polltime=60s"
+ - "--polltime=15s"


Just because it "looks long" is not a good reason. Maybe you can look for other projects that do something like this and see what time they default to?

Yes, I should have been clear before, got inspired by:

https:/kubernetes-csi/livenessprobe/blob/bba64df584a52d98ddf1904b7f8ec20a2828257c/deployment/kubernetes/livenessprobe-sidecar.yaml#L21

https:/kubernetes-csi/livenessprobe#usage
https:/kubernetes-csi/livenessprobe#command-line-options
https:/kubernetes-csi/livenessprobe/blob/master/deployment/kubernetes/livenessprobe-sidecar.yaml#L21

https:/openshift/csi-livenessprobe#usage
https:/openshift/csi-livenessprobe#other-recognized-arguments
https:/openshift/csi-livenessprobe/blob/master/deployment/kubernetes/livenessprobe-sidecar.yaml#L21

nixpanic · 2020-11-05T11:09:25Z

.commitlintrc.yml

@@ -41,3 +41,4 @@ rules:
 - rebase
 - revert
 - util
+ - liveness


keep these alphabetically sorted please

This is to improve and simplify the ease of reuse with config and grpc conn across multiple routines in the next patches Signed-off-by: Prasanna Kumar Kalever <[email protected]>

Added few more details to log Msgs Signed-off-by: Prasanna Kumar Kalever <[email protected]>

The health status liveness probe shares and runs within the liveness-prometheus container. The health status liveness probe listen and serve requests at a dedicated port and path. By default they listen at '/healthz' path and '9680' port, which can be easily configurable. Fixes: ceph#1096 Signed-off-by: Prasanna Kumar Kalever <[email protected]>

60s polltime looks long, reducing it to 15s Signed-off-by: Prasanna Kumar Kalever <[email protected]>

Improvements to type=liveness at cephcsi should be different component Signed-off-by: Prasanna Kumar Kalever <[email protected]>

pkalever · 2020-11-06T10:42:25Z

@nixpanic please take a look. Thanks!

pkalever · 2020-11-06T12:00:16Z

/retest ci/centos/mini-e2e-helm/k8s-1.18

pkalever · 2020-11-06T12:00:31Z

/retest ci/centos/mini-e2e-helm/k8s-1.19

pkalever · 2020-11-06T12:02:37Z

Not a related issue, says :

NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
csi-rbdplugin   1         1         0       1            0           <none>          20m
[Timeout] Failed to get daemonset
script returned exit code 1

Madhu-1

I would suggest to keep the parameters very minimal and make use of available parameters

Madhu-1 · 2020-11-06T15:27:21Z

cmd/cephcsi.go

@@ -66,7 +66,7 @@ func init() {
 flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas")

 // liveness/grpc metrics related flags
- flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests")
+ flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests")


won't it cause an upgrade issue? if someone just updated the image (which is the case for minor releases)

Madhu-1 · 2020-11-06T15:27:55Z

cmd/cephcsi.go

@@ -70,6 +70,8 @@ func init() {
 flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available")
 flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll")
 flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
+ flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests")


why we need one more port? cant we just add one more endpoint to the metrics port?

Madhu-1 · 2020-11-06T15:30:17Z

deploy/cephfs/kubernetes/csi-cephfsplugin-provisioner.yaml

- - "--polltime=60s"
+ - "--healthzport=9681"
+ - "--healthzpath=/healthz"
+ - "--polltime=15s"


you can always pool the CSIDriver when you get a request to get live status.

Maybe you mean probe? yes, we can always probe check.

As pointed other referenced projects are using 2sec and I feel 15 sec is a good time, but I will leave it up to the maintainers.

Madhu-1 · 2020-11-06T15:32:49Z

internal/liveness/liveness.go

+ if err != nil {
+ util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
+ }
+ util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)


won't err value overwritten when you call w.Write()?

good catch, will fix it.

Madhu-1 · 2020-11-06T15:33:15Z

internal/liveness/liveness.go

+ w.WriteHeader(http.StatusInternalServerError)
+ _, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
+ if err != nil {
+ util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
+ }
+
+ util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")


looks like this can go to helper function

Madhu-1 · 2020-11-06T15:40:46Z

internal/liveness/liveness.go

+ ready, err := rpc.Probe(ctx, c.conn)
+ if err != nil {
+ w.WriteHeader(http.StatusInternalServerError)
+ _, err = w.Write([]byte(err.Error()))
+ if err != nil {
+ util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
+ }
+ util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)
+ return
+ }
+
+ if !ready {
+ w.WriteHeader(http.StatusInternalServerError)
+ _, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
+ if err != nil {
+ util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
+ }
+
+ util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")
+ return
+ }
+
+ w.WriteHeader(http.StatusOK)
+ _, err = w.Write([]byte(`ok`))
+ if err != nil {
+ util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
+ }
+ util.ExtendedLog(ctx, "Healthz req: Health check succeeded")
+}


Suggested change

ready, err := rpc.Probe(ctx, c.conn)

if err != nil {

w.WriteHeader(http.StatusInternalServerError)

_, err = w.Write([]byte(err.Error()))

if err != nil {

util.ErrorLog(ctx, "Healthz req: write failed: %v", err)

}

util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)

return

}

if !ready {

w.WriteHeader(http.StatusInternalServerError)

_, err = w.Write([]byte("Healthz req: driver responded but is not ready"))

if err != nil {

util.ErrorLog(ctx, "Healthz req: write failed: %v", err)

}

util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")

return

}

w.WriteHeader(http.StatusOK)

_, err = w.Write([]byte(`ok`))

if err != nil {

util.ErrorLog(ctx, "Healthz req: write failed: %v", err)

}

util.ExtendedLog(ctx, "Healthz req: Health check succeeded")

}

resp:="ok"

statuscode=http.StatusOK

ready, err := rpc.Probe(ctx, c.conn)

if err!=nil{

resp=err.Error()

statuscode=http.StatusInternalServerError

}else{

if !ready{

resp:="Healthz req: driver responded but is not ready"

statuscode=http.StatusInternalServerError

}

}

w.WriteHeader(statuscode)

_, err = w.Write([]byte(resp))

if err != nil {

util.ErrorLog(ctx, "Healthz req: write failed: %v", err)

}

Madhu-1 · 2020-11-06T15:41:46Z

internal/liveness/liveness.go

 if err != nil {
 liveness.Set(0)
- util.ErrorLogMsg("health check failed: %v", err)
+ util.ErrorLog(ctx, "Metrics req: health check failed: %v", err)


ctx logging is not much helpful as it doesn't contain any information on what we need

OK. Will, it hurt if we have ctx?

it won't hurt, but as we have a separate function for logging without context let's use it.

Madhu-1 · 2020-11-06T15:42:51Z

internal/liveness/liveness.go

+ address := net.JoinHostPort(conf.MetricsIP, conf.HealthzPort)
+ http.HandleFunc(conf.HealthzPath, pc.checkProbe)
+ util.ExtendedLogMsg("Serving Health requests on: http://%s%s", address, conf.HealthzPath)
+ err = http.ListenAndServe(address, nil)


why we are starting one more server. is it needed?

Yeah, this is only needed if we want to listen on separate ports.
For a single port and multiple paths like: '/metrics' and '/healthz', we can live with a single server

I would suggest to start one server not multiple servers

pkalever

@Madhu-1 Thanks for the review, will await for your opinion.

pkalever · 2020-11-09T09:38:41Z

cmd/cephcsi.go

@@ -66,7 +66,7 @@ func init() {
 flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas")

 // liveness/grpc metrics related flags
- flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests")
+ flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests")


I don't think it will cause any upgrade issues.

pkalever · 2020-11-09T09:40:04Z

cmd/cephcsi.go

@@ -70,6 +70,8 @@ func init() {
 flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available")
 flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll")
 flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
+ flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests")


OK, this is something I followed from the kube liveness probe project, its like giving a provision for users. I'm fine if we just want to use a single port for both metrics and health status.

Check: https:/ceph/ceph-csi/pull/1560/files#diff-e3217918d2c8805d4f5446edf4350a9ee17a8616deb3ce92352a83765c730db5R163

pkalever · 2020-11-09T09:41:32Z

deploy/cephfs/kubernetes/csi-cephfsplugin-provisioner.yaml

- - "--polltime=60s"
+ - "--healthzport=9681"
+ - "--healthzpath=/healthz"
+ - "--polltime=15s"


Maybe you mean probe? yes, we can always probe check.

As pointed other referenced projects are using 2sec and I feel 15 sec is a good time, but I will leave it up to the maintainers.

pkalever · 2020-11-09T09:42:01Z

internal/liveness/liveness.go

+ if err != nil {
+ util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
+ }
+ util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)


good catch, will fix it.

pkalever · 2020-11-09T09:44:06Z

internal/liveness/liveness.go

+ w.WriteHeader(http.StatusInternalServerError)
+ _, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
+ if err != nil {
+ util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
+ }
+
+ util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")


pkalever · 2020-11-09T09:45:12Z

internal/liveness/liveness.go

 if err != nil {
 liveness.Set(0)
- util.ErrorLogMsg("health check failed: %v", err)
+ util.ErrorLog(ctx, "Metrics req: health check failed: %v", err)


OK. Will, it hurt if we have ctx?

pkalever · 2020-11-09T09:49:12Z

internal/liveness/liveness.go

+ address := net.JoinHostPort(conf.MetricsIP, conf.HealthzPort)
+ http.HandleFunc(conf.HealthzPath, pc.checkProbe)
+ util.ExtendedLogMsg("Serving Health requests on: http://%s%s", address, conf.HealthzPath)
+ err = http.ListenAndServe(address, nil)


Yeah, this is only needed if we want to listen on separate ports.
For a single port and multiple paths like: '/metrics' and '/healthz', we can live with a single server

Madhu-1 · 2020-11-26T07:16:29Z

one general suggestion would be to just add a new URL to the current liveness server not to have one more server

mergify · 2021-05-25T10:42:33Z

This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes? 🙏

github-actions · 2021-09-04T21:05:41Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

github-actions · 2021-09-19T21:05:01Z

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

Prasanna Kumar Kalever added 3 commits November 5, 2020 01:52

liveness: add verbose flag with verbosity 5

c5255c9

verbose flag is missing for all liveness prometheus containers, this patch adds verbosity 5 to all liveness conatiners Signed-off-by: Prasanna Kumar Kalever <[email protected]>

liveness: fix typo in the config option name

22ec386

s/PoolTimeout/ProbeTimeout/ Signed-off-by: Prasanna Kumar Kalever <[email protected]>

liveness: config option MetricIP should be of type string

63b1e80

MetricIP is consumed as string, hence collecting it as int doesn't make anysense, which will involve type converstion later. Signed-off-by: Prasanna Kumar Kalever <[email protected]>

pkalever mentioned this pull request Nov 4, 2020

Add liveness probe sidecar container #1560

Closed

nixpanic reviewed Nov 5, 2020

View reviewed changes

Prasanna Kumar Kalever added 5 commits November 6, 2020 14:57

liveness: move config and grpc conn into a new struct

3b8f632

This is to improve and simplify the ease of reuse with config and grpc conn across multiple routines in the next patches Signed-off-by: Prasanna Kumar Kalever <[email protected]>

liveness: improve logging details

f2fc69f

Added few more details to log Msgs Signed-off-by: Prasanna Kumar Kalever <[email protected]>

liveness: set polltime to 15s

3f17608

60s polltime looks long, reducing it to 15s Signed-off-by: Prasanna Kumar Kalever <[email protected]>

ci: add liveness to the list of valid components

5f337ac

Improvements to type=liveness at cephcsi should be different component Signed-off-by: Prasanna Kumar Kalever <[email protected]>

pkalever force-pushed the liveness-healthz branch from bdf4f38 to 5f337ac Compare November 6, 2020 10:41

pkalever requested a review from nixpanic November 6, 2020 10:42

Madhu-1 requested changes Nov 6, 2020

View reviewed changes

pkalever commented Nov 9, 2020

View reviewed changes

pkalever requested a review from Madhu-1 November 19, 2020 08:20

Base automatically changed from master to devel March 1, 2021 05:22

nixpanic added the component/deployment Helm chart, kubernetes templates and configuration Issues/PRs label Aug 5, 2021

github-actions bot added the stale label Sep 4, 2021

github-actions bot closed this Sep 19, 2021

liveness: add health status liveness probe sidecar #1665

liveness: add health status liveness probe sidecar #1665

Conversation

pkalever commented Nov 4, 2020

Describe what this PR does

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkalever commented Nov 6, 2020

pkalever commented Nov 6, 2020

pkalever commented Nov 6, 2020

pkalever commented Nov 6, 2020

Madhu-1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkalever left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 commented Nov 26, 2020

mergify bot commented May 25, 2021

github-actions bot commented Sep 4, 2021

github-actions bot commented Sep 19, 2021