-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
liveness: add health status liveness probe sidecar #1665
Conversation
verbose flag is missing for all liveness prometheus containers, this patch adds verbosity 5 to all liveness conatiners Signed-off-by: Prasanna Kumar Kalever <[email protected]>
s/PoolTimeout/ProbeTimeout/ Signed-off-by: Prasanna Kumar Kalever <[email protected]>
MetricIP is consumed as string, hence collecting it as int doesn't make anysense, which will involve type converstion later. Signed-off-by: Prasanna Kumar Kalever <[email protected]>
@@ -66,7 +66,7 @@ func init() { | |||
flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas") | |||
|
|||
// liveness/grpc metrics related flags | |||
flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests") | |||
flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong, it will likely miss any integer-formatting checks. Better keep it an int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, I'm trying to save the conversion from int to string here, as the JoinHostPort() from net package expects port in string form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't it cause an upgrade issue? if someone just updated the image (which is the case for minor releases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will cause any upgrade issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please check once?
internal/liveness/liveness.go
Outdated
@@ -35,10 +35,12 @@ var ( | |||
Name: "liveness", | |||
Help: "Liveness Probe", | |||
}) | |||
csiConn *grpc.ClientConn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced making this global is a good thing. Maybe it is better to have a config type that can be used by calling methods on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will look, how to make it better.
@@ -70,6 +70,8 @@ func init() { | |||
flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available") | |||
flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll") | |||
flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds") | |||
flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ports are integers, not strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my previous comments should answer this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need one more port? cant we just add one more endpoint to the metrics port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is something I followed from the kube liveness probe project, its like giving a provision for users. I'm fine if we just want to use a single port for both metrics and health status.
@@ -154,7 +154,7 @@ spec: | |||
- "--metricspath=/metrics" | |||
- "--healthzport=9681" | |||
- "--healthzpath=/healthz" | |||
- "--polltime=60s" | |||
- "--polltime=15s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because it "looks long" is not a good reason. Maybe you can look for other projects that do something like this and see what time they default to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should have been clear before, got inspired by:
- https:/kubernetes-csi/livenessprobe/blob/bba64df584a52d98ddf1904b7f8ec20a2828257c/deployment/kubernetes/livenessprobe-sidecar.yaml#L21
- https:/kubernetes-csi/livenessprobe#usage
https:/kubernetes-csi/livenessprobe#command-line-options
https:/kubernetes-csi/livenessprobe/blob/master/deployment/kubernetes/livenessprobe-sidecar.yaml#L21 - https:/openshift/csi-livenessprobe#usage
https:/openshift/csi-livenessprobe#other-recognized-arguments
https:/openshift/csi-livenessprobe/blob/master/deployment/kubernetes/livenessprobe-sidecar.yaml#L21
.commitlintrc.yml
Outdated
@@ -41,3 +41,4 @@ rules: | |||
- rebase | |||
- revert | |||
- util | |||
- liveness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep these alphabetically sorted please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
This is to improve and simplify the ease of reuse with config and grpc conn across multiple routines in the next patches Signed-off-by: Prasanna Kumar Kalever <[email protected]>
Added few more details to log Msgs Signed-off-by: Prasanna Kumar Kalever <[email protected]>
The health status liveness probe shares and runs within the liveness-prometheus container. The health status liveness probe listen and serve requests at a dedicated port and path. By default they listen at '/healthz' path and '9680' port, which can be easily configurable. Fixes: ceph#1096 Signed-off-by: Prasanna Kumar Kalever <[email protected]>
60s polltime looks long, reducing it to 15s Signed-off-by: Prasanna Kumar Kalever <[email protected]>
Improvements to type=liveness at cephcsi should be different component Signed-off-by: Prasanna Kumar Kalever <[email protected]>
bdf4f38
to
5f337ac
Compare
@nixpanic please take a look. Thanks! |
/retest ci/centos/mini-e2e-helm/k8s-1.18 |
/retest ci/centos/mini-e2e-helm/k8s-1.19 |
Not a related issue, says :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to keep the parameters very minimal and make use of available parameters
@@ -66,7 +66,7 @@ func init() { | |||
flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas") | |||
|
|||
// liveness/grpc metrics related flags | |||
flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests") | |||
flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't it cause an upgrade issue? if someone just updated the image (which is the case for minor releases)
@@ -70,6 +70,8 @@ func init() { | |||
flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available") | |||
flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll") | |||
flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds") | |||
flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need one more port? cant we just add one more endpoint to the metrics port?
- "--polltime=60s" | ||
- "--healthzport=9681" | ||
- "--healthzpath=/healthz" | ||
- "--polltime=15s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can always pool the CSIDriver when you get a request to get live status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you mean probe? yes, we can always probe check.
As pointed other referenced projects are using 2sec and I feel 15 sec is a good time, but I will leave it up to the maintainers.
if err != nil { | ||
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | ||
} | ||
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't err
value overwritten when you call w.Write()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will fix it.
w.WriteHeader(http.StatusInternalServerError) | ||
_, err = w.Write([]byte("Healthz req: driver responded but is not ready")) | ||
if err != nil { | ||
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | ||
} | ||
|
||
util.ErrorLog(ctx, "Healthz req: driver responded but is not ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this can go to helper function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
ready, err := rpc.Probe(ctx, c.conn) | ||
if err != nil { | ||
w.WriteHeader(http.StatusInternalServerError) | ||
_, err = w.Write([]byte(err.Error())) | ||
if err != nil { | ||
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | ||
} | ||
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err) | ||
return | ||
} | ||
|
||
if !ready { | ||
w.WriteHeader(http.StatusInternalServerError) | ||
_, err = w.Write([]byte("Healthz req: driver responded but is not ready")) | ||
if err != nil { | ||
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | ||
} | ||
|
||
util.ErrorLog(ctx, "Healthz req: driver responded but is not ready") | ||
return | ||
} | ||
|
||
w.WriteHeader(http.StatusOK) | ||
_, err = w.Write([]byte(`ok`)) | ||
if err != nil { | ||
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | ||
} | ||
util.ExtendedLog(ctx, "Healthz req: Health check succeeded") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready, err := rpc.Probe(ctx, c.conn) | |
if err != nil { | |
w.WriteHeader(http.StatusInternalServerError) | |
_, err = w.Write([]byte(err.Error())) | |
if err != nil { | |
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | |
} | |
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err) | |
return | |
} | |
if !ready { | |
w.WriteHeader(http.StatusInternalServerError) | |
_, err = w.Write([]byte("Healthz req: driver responded but is not ready")) | |
if err != nil { | |
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | |
} | |
util.ErrorLog(ctx, "Healthz req: driver responded but is not ready") | |
return | |
} | |
w.WriteHeader(http.StatusOK) | |
_, err = w.Write([]byte(`ok`)) | |
if err != nil { | |
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | |
} | |
util.ExtendedLog(ctx, "Healthz req: Health check succeeded") | |
} | |
resp:="ok" | |
statuscode=http.StatusOK | |
ready, err := rpc.Probe(ctx, c.conn) | |
if err!=nil{ | |
resp=err.Error() | |
statuscode=http.StatusInternalServerError | |
}else{ | |
if !ready{ | |
resp:="Healthz req: driver responded but is not ready" | |
statuscode=http.StatusInternalServerError | |
} | |
} | |
w.WriteHeader(statuscode) | |
_, err = w.Write([]byte(resp)) | |
if err != nil { | |
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | |
} |
if err != nil { | ||
liveness.Set(0) | ||
util.ErrorLogMsg("health check failed: %v", err) | ||
util.ErrorLog(ctx, "Metrics req: health check failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx logging is not much helpful as it doesn't contain any information on what we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Will, it hurt if we have ctx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it won't hurt, but as we have a separate function for logging without context let's use it.
address := net.JoinHostPort(conf.MetricsIP, conf.HealthzPort) | ||
http.HandleFunc(conf.HealthzPath, pc.checkProbe) | ||
util.ExtendedLogMsg("Serving Health requests on: http://%s%s", address, conf.HealthzPath) | ||
err = http.ListenAndServe(address, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are starting one more server. is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is only needed if we want to listen on separate ports.
For a single port and multiple paths like: '/metrics' and '/healthz', we can live with a single server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to start one server not multiple servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 Thanks for the review, will await for your opinion.
@@ -66,7 +66,7 @@ func init() { | |||
flag.BoolVar(&conf.ForceKernelCephFS, "forcecephkernelclient", false, "enable Ceph Kernel clients on kernel < 4.17 which support quotas") | |||
|
|||
// liveness/grpc metrics related flags | |||
flag.IntVar(&conf.MetricsPort, "metricsport", 8080, "TCP port for liveness/grpc metrics requests") | |||
flag.StringVar(&conf.MetricsPort, "metricsport", "8080", "TCP port for liveness/grpc metrics requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will cause any upgrade issues.
@@ -70,6 +70,8 @@ func init() { | |||
flag.StringVar(&conf.MetricsPath, "metricspath", "/metrics", "path of prometheus endpoint where metrics will be available") | |||
flag.DurationVar(&conf.PollTime, "polltime", time.Second*pollTime, "time interval in seconds between each poll") | |||
flag.DurationVar(&conf.ProbeTimeout, "timeout", time.Second*probeTimeout, "probe timeout in seconds") | |||
flag.StringVar(&conf.HealthzPort, "healthzport", "9808", "TCP ports for listening healthz requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is something I followed from the kube liveness probe project, its like giving a provision for users. I'm fine if we just want to use a single port for both metrics and health status.
- "--polltime=60s" | ||
- "--healthzport=9681" | ||
- "--healthzpath=/healthz" | ||
- "--polltime=15s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you mean probe? yes, we can always probe check.
As pointed other referenced projects are using 2sec and I feel 15 sec is a good time, but I will leave it up to the maintainers.
if err != nil { | ||
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | ||
} | ||
util.ErrorLog(ctx, "Healthz req: health check failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will fix it.
w.WriteHeader(http.StatusInternalServerError) | ||
_, err = w.Write([]byte("Healthz req: driver responded but is not ready")) | ||
if err != nil { | ||
util.ErrorLog(ctx, "Healthz req: write failed: %v", err) | ||
} | ||
|
||
util.ErrorLog(ctx, "Healthz req: driver responded but is not ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
if err != nil { | ||
liveness.Set(0) | ||
util.ErrorLogMsg("health check failed: %v", err) | ||
util.ErrorLog(ctx, "Metrics req: health check failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Will, it hurt if we have ctx?
address := net.JoinHostPort(conf.MetricsIP, conf.HealthzPort) | ||
http.HandleFunc(conf.HealthzPath, pc.checkProbe) | ||
util.ExtendedLogMsg("Serving Health requests on: http://%s%s", address, conf.HealthzPath) | ||
err = http.ListenAndServe(address, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is only needed if we want to listen on separate ports.
For a single port and multiple paths like: '/metrics' and '/healthz', we can live with a single server
one general suggestion would be to just add a new URL to the current liveness server not to have one more server |
This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes? 🙏 |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |
This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required. |
Describe what this PR does
The health status liveness probe shares and runs within the liveness-prometheus container. The health status liveness probe listen and serve requests at a dedicated port and path. By default they listen at '/healthz' path and '9680' port, which can be easily configurable.
Useful logs:
useful commands:
Fixes: #1096
Signed-off-by: Prasanna Kumar Kalever [email protected]