Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness: add health status liveness probe sidecar #1665

Closed
wants to merge 8 commits into from

Conversation

pkalever
Copy link

@pkalever pkalever commented Nov 4, 2020

Describe what this PR does

The health status liveness probe shares and runs within the liveness-prometheus container. The health status liveness probe listen and serve requests at a dedicated port and path. By default they listen at '/healthz' path and '9680' port, which can be easily configurable.

Useful logs:

[0] pkalever 😎 ceph-csi✨ kubectl logs csi-rbdplugin-5dzrn liveness-prometheus                                   
I1104 20:16:06.923915  212212 cephcsi.go:125] Driver version: canary and Git version: 59aff136b51f38bdbca65bb0782215b7b786f89b
I1104 20:16:06.924212  212212 cephcsi.go:171] Starting driver type: liveness with name: liveness.csi.ceph.com     
I1104 20:16:06.924263  212212 liveness.go:122] Liveness Running                  
I1104 20:16:06.924458  212212 httpserver.go:21] Serving Metrics requests on: http://192.168.39.188:8680/metrics   
I1104 20:16:06.924555  212212 liveness.go:133] Serving Health requests on: http://192.168.39.188:9680/healthz
I1104 20:16:06.924739  212212 connection.go:153] Connecting to unix:///csi/csi.sock                               
I1104 20:16:06.927573  212212 liveness.go:110] CSI driver: "rbd.csi.ceph.com", Endpoint: unix:///csi/csi.sock
I1104 20:16:21.928003  212212 liveness.go:73] Metrics req: Sending probe request to CSI driver: "rbd.csi.ceph.com"
I1104 20:16:21.930418  212212 liveness.go:87] Metrics req: Health check succeeded
[...]
I1104 20:17:21.931447  212212 liveness.go:87] Metrics req: Health check succeeded
I1104 20:17:36.928015  212212 liveness.go:73] Metrics req: Sending probe request to CSI driver: "rbd.csi.ceph.com"
I1104 20:17:36.930653  212212 liveness.go:87] Metrics req: Health check succeeded
I1104 20:17:48.331070  212212 liveness.go:48] Healthz req: Sending probe request to CSI driver "rbd.csi.ceph.com" 
I1104 20:17:48.332878  212212 liveness.go:66] Healthz req: Health check succeeded

useful commands:

# kubectl exec -it  csi-rbdplugin-5dzrn -c csi-rbdplugin -- curl -X GET http://192.168.39.188:8680/metrics
# kubectl exec -it  csi-rbdplugin-5dzrn -c csi-rbdplugin -- curl -X GET http://192.168.39.188:9680/healthz

Also,
# minikube ssh
# docker ps | grep csi-rbdplugin-
# docker stop <contaner-ID> 

Fixes: #1096

Signed-off-by: Prasanna Kumar Kalever [email protected]

Prasanna Kumar Kalever added 3 commits November 5, 2020 01:52
verbose flag is missing for all liveness prometheus containers,
this patch adds verbosity 5 to all liveness conatiners

Signed-off-by: Prasanna Kumar Kalever <[email protected]>
s/PoolTimeout/ProbeTimeout/

Signed-off-by: Prasanna Kumar Kalever <[email protected]>
MetricIP is consumed as string, hence collecting it as int doesn't
make anysense, which will involve type converstion later.

Signed-off-by: Prasanna Kumar Kalever <[email protected]>
@@ -66,7 +66,7 @@ func init() {
flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas")

// liveness/grpc metrics related flags
flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests")
flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong, it will likely miss any integer-formatting checks. Better keep it an int.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, I'm trying to save the conversion from int to string here, as the JoinHostPort() from net package expects port in string form.

Also see:
https:/kubernetes-csi/livenessprobe/blob/bba64df584a52d98ddf1904b7f8ec20a2828257c/cmd/livenessprobe/main.go#L39

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't it cause an upgrade issue? if someone just updated the image (which is the case for minor releases)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will cause any upgrade issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please check once?

@@ -35,10 +35,12 @@ var (
Name: "liveness",
Help: "Liveness Probe",
})
csiConn *grpc.ClientConn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced making this global is a good thing. Maybe it is better to have a config type that can be used by calling methods on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will look, how to make it better.

@@ -70,6 +70,8 @@ func init() {
flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available")
flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll")
flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ports are integers, not strings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my previous comments should answer this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need one more port? cant we just add one more endpoint to the metrics port?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is something I followed from the kube liveness probe project, its like giving a provision for users. I'm fine if we just want to use a single port for both metrics and health status.

Check: https:/ceph/ceph-csi/pull/1560/files#diff-e3217918d2c8805d4f5446edf4350a9ee17a8616deb3ce92352a83765c730db5R163

@@ -154,7 +154,7 @@ spec:
- "--metricspath=/metrics"
- "--healthzport=9681"
- "--healthzpath=/healthz"
- "--polltime=60s"
- "--polltime=15s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it "looks long" is not a good reason. Maybe you can look for other projects that do something like this and see what time they default to?

@@ -41,3 +41,4 @@ rules:
- rebase
- revert
- util
- liveness
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep these alphabetically sorted please

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Prasanna Kumar Kalever added 5 commits November 6, 2020 14:57
This is to improve and simplify the ease of reuse with config and grpc
conn across multiple routines in the next patches

Signed-off-by: Prasanna Kumar Kalever <[email protected]>
Added few more details to log Msgs

Signed-off-by: Prasanna Kumar Kalever <[email protected]>
The health status liveness probe shares and runs within the
liveness-prometheus container. The health status liveness probe
listen and serve requests at a dedicated port and path. By default
they listen at '/healthz' path and '9680' port, which can be
easily configurable.

Fixes: ceph#1096
Signed-off-by: Prasanna Kumar Kalever <[email protected]>
60s polltime looks long, reducing it to 15s

Signed-off-by: Prasanna Kumar Kalever <[email protected]>
Improvements to type=liveness at cephcsi should be different
component

Signed-off-by: Prasanna Kumar Kalever <[email protected]>
@pkalever
Copy link
Author

pkalever commented Nov 6, 2020

@nixpanic please take a look. Thanks!

@pkalever
Copy link
Author

pkalever commented Nov 6, 2020

/retest ci/centos/mini-e2e-helm/k8s-1.18

@pkalever
Copy link
Author

pkalever commented Nov 6, 2020

/retest ci/centos/mini-e2e-helm/k8s-1.19

@pkalever
Copy link
Author

pkalever commented Nov 6, 2020

Not a related issue, says :

NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
csi-rbdplugin   1         1         0       1            0           <none>          20m
[Timeout] Failed to get daemonset
script returned exit code 1

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to keep the parameters very minimal and make use of available parameters

@@ -66,7 +66,7 @@ func init() {
flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas")

// liveness/grpc metrics related flags
flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests")
flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't it cause an upgrade issue? if someone just updated the image (which is the case for minor releases)

@@ -70,6 +70,8 @@ func init() {
flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available")
flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll")
flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need one more port? cant we just add one more endpoint to the metrics port?

- "--polltime=60s"
- "--healthzport=9681"
- "--healthzpath=/healthz"
- "--polltime=15s"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can always pool the CSIDriver when you get a request to get live status.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you mean probe? yes, we can always probe check.

As pointed other referenced projects are using 2sec and I feel 15 sec is a good time, but I will leave it up to the maintainers.

if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't err value overwritten when you call w.Write()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will fix it.

Comment on lines +64 to +70
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}

util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this can go to helper function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Comment on lines +52 to +80
ready, err := rpc.Probe(ctx, c.conn)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte(err.Error()))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)
return
}

if !ready {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}

util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")
return
}

w.WriteHeader(http.StatusOK)
_, err = w.Write([]byte(`ok`))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ExtendedLog(ctx, "Healthz req: Health check succeeded")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ready, err := rpc.Probe(ctx, c.conn)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte(err.Error()))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)
return
}
if !ready {
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")
return
}
w.WriteHeader(http.StatusOK)
_, err = w.Write([]byte(`ok`))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ExtendedLog(ctx, "Healthz req: Health check succeeded")
}
resp:="ok"
statuscode=http.StatusOK
ready, err := rpc.Probe(ctx, c.conn)
if err!=nil{
resp=err.Error()
statuscode=http.StatusInternalServerError
}else{
if !ready{
resp:="Healthz req: driver responded but is not ready"
statuscode=http.StatusInternalServerError
}
}
w.WriteHeader(statuscode)
_, err = w.Write([]byte(resp))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}

if err != nil {
liveness.Set(0)
util.ErrorLogMsg("health check failed: %v", err)
util.ErrorLog(ctx, "Metrics req: health check failed: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx logging is not much helpful as it doesn't contain any information on what we need

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will, it hurt if we have ctx?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't hurt, but as we have a separate function for logging without context let's use it.

Comment on lines +149 to +152
address := net.JoinHostPort(conf.MetricsIP, conf.HealthzPort)
http.HandleFunc(conf.HealthzPath, pc.checkProbe)
util.ExtendedLogMsg("Serving Health requests on: http://%s%s", address, conf.HealthzPath)
err = http.ListenAndServe(address, nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are starting one more server. is it needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is only needed if we want to listen on separate ports.
For a single port and multiple paths like: '/metrics' and '/healthz', we can live with a single server

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to start one server not multiple servers

Copy link
Author

@pkalever pkalever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 Thanks for the review, will await for your opinion.

@@ -66,7 +66,7 @@ func init() {
flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas")

// liveness/grpc metrics related flags
flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests")
flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will cause any upgrade issues.

@@ -70,6 +70,8 @@ func init() {
flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available")
flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll")
flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds")
flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is something I followed from the kube liveness probe project, its like giving a provision for users. I'm fine if we just want to use a single port for both metrics and health status.

Check: https:/ceph/ceph-csi/pull/1560/files#diff-e3217918d2c8805d4f5446edf4350a9ee17a8616deb3ce92352a83765c730db5R163

- "--polltime=60s"
- "--healthzport=9681"
- "--healthzpath=/healthz"
- "--polltime=15s"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you mean probe? yes, we can always probe check.

As pointed other referenced projects are using 2sec and I feel 15 sec is a good time, but I will leave it up to the maintainers.

if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will fix it.

Comment on lines +64 to +70
w.WriteHeader(http.StatusInternalServerError)
_, err = w.Write([]byte("Healthz req: driver responded but is not ready"))
if err != nil {
util.ErrorLog(ctx, "Healthz req: write failed: %v", err)
}

util.ErrorLog(ctx, "Healthz req: driver responded but is not ready")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

if err != nil {
liveness.Set(0)
util.ErrorLogMsg("health check failed: %v", err)
util.ErrorLog(ctx, "Metrics req: health check failed: %v", err)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will, it hurt if we have ctx?

Comment on lines +149 to +152
address := net.JoinHostPort(conf.MetricsIP, conf.HealthzPort)
http.HandleFunc(conf.HealthzPath, pc.checkProbe)
util.ExtendedLogMsg("Serving Health requests on: http://%s%s", address, conf.HealthzPath)
err = http.ListenAndServe(address, nil)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is only needed if we want to listen on separate ports.
For a single port and multiple paths like: '/metrics' and '/healthz', we can live with a single server

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Nov 26, 2020

one general suggestion would be to just add a new URL to the current liveness server not to have one more server

Base automatically changed from master to devel March 1, 2021 05:22
@mergify
Copy link
Contributor

mergify bot commented May 25, 2021

This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes? 🙏

@nixpanic nixpanic added the component/deployment Helm chart, kubernetes templates and configuration Issues/PRs label Aug 5, 2021
@github-actions
Copy link

github-actions bot commented Sep 4, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 4, 2021
@github-actions
Copy link

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

@github-actions github-actions bot closed this Sep 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/deployment Helm chart, kubernetes templates and configuration Issues/PRs stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add liveness sidecar to ceph csi drivers
3 participants