Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWX-38118: Health Checks #1646

Open
wants to merge 1 commit into
base: release-24.2.0
Choose a base branch
from

Conversation

lpabon
Copy link
Member

@lpabon lpabon commented Aug 15, 2024

What this PR does / why we need it:
Adds environment health checks when a new StorageCluster is installed.

Which issue(s) this PR fixes (optional)
Closes #

Special notes for your reviewer:
There are two major sections in this PR:

  1. The health check framework is under px/px-health-check. This framework will be removed from this repo and moved to its own repo at a later date so that other programs can include it easily. You may ignore it for the review if you want.
  2. The health checks and integration with the Operator. When the reconciler starts, it will check if the StorageCluster object has the portworx.io/health-check annotation:
  • if the annotation is set to skip, then the Health Checks will not run.
  • if the annotation is missing, then the Health Checks will execute. It will then set the annotation to false if it failed, or passed if it worked. If it fails the test, it will also return failure disallowing PxE to be installed.
  • if the annotation is set to false, it will again return an error disallowing PxE to be installed.
  • if the annotation is set to passed, it will not run.

Ignore the test coverage of px/px-health-check since that will be moved to a new repo after.

@lpabon
Copy link
Member Author

lpabon commented Aug 15, 2024

Testing

Positive Testing

  1. I used the file pxe.yml attached to make sure it can install without error.
  2. Also asserted that the annotation portworx.io/health-check is set to passed.

Negative Testing

  1. Remove and uninstalled the previous installation
  2. I have a k8s with three nodes and a control node. I labeled one of the nodes with px/enabled=false.
  3. I used the file pxe-affinity.yml to create a cluster.
  4. Assert that PxE is not installed
  5. Assert that describing the StorageCluster provides information about the failure and provides a link to the documentation: https://docs.portworx.com/codes/pxe/1004

@lpabon
Copy link
Member Author

lpabon commented Aug 15, 2024

test-storageclusters.zip
These are the test yamls

Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 69.03226% with 192 lines in your changes missing coverage. Please review.

Project coverage is 79.51%. Comparing base (30ed9ab) to head (d29910f).

Files Patch % Lines
px/px-health-check/pkg/healthcheck/healthcheck.go 56.66% 50 Missing and 2 partials ⚠️
pkg/healthcheck/precheck.go 28.57% 20 Missing ⚠️
pkg/util/util.go 0.00% 19 Missing ⚠️
px/px-health-check/pkg/healthcheck/observer.go 0.00% 18 Missing ⚠️
pkg/healthcheck/checks/kubernetes/gather.go 30.43% 15 Missing and 1 partial ⚠️
pkg/controller/storagecluster/storagecluster.go 82.05% 10 Missing and 4 partials ⚠️
px/px-health-check/pkg/healthcheck/reporter.go 83.82% 11 Missing ⚠️
pkg/healthcheck/checks/kubernetes/memory.go 79.16% 5 Missing and 5 partials ⚠️
px/px-health-check/pkg/healthcheck/errorstypes.go 18.18% 9 Missing ⚠️
px/px-health-check/pkg/healthcheck/state.go 66.66% 6 Missing and 3 partials ⚠️
... and 4 more
Additional details and impacted files
@@                Coverage Diff                 @@
##           release-24.2.0    #1646      +/-   ##
==================================================
- Coverage           79.79%   79.51%   -0.29%     
==================================================
  Files                  61       76      +15     
  Lines               18831    19436     +605     
==================================================
+ Hits                15026    15454     +428     
- Misses               2715     2871     +156     
- Partials             1090     1111      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lpabon lpabon marked this pull request as draft August 16, 2024 03:01
@lpabon lpabon force-pushed the eng/lpabon/healthcheck-pr-plus-framework2 branch 2 times, most recently from 6867357 to 4ed6556 Compare August 19, 2024 03:57
@lpabon lpabon marked this pull request as ready for review August 19, 2024 03:57
@lpabon lpabon force-pushed the eng/lpabon/healthcheck-pr-plus-framework2 branch from 4ed6556 to 88399d3 Compare August 19, 2024 04:00
// If not set, then run the health checks
check := cluster.Annotations[pxutil.AnnotationHealthCheck]
check = strings.TrimSpace(strings.ToLower(check))
if check == "skip" || check == "passed" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Equalfold can be used here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can values be a constant and this be part of health check framework

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Equalfold can be used here Didn't know about this call! Nice catch

cluster.Name,
pxutil.AnnotationHealthCheck,
)
if check == "failed" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
if healthCheck fail once, should we always keep failing ?
Additionally, asking the user to remove the annotation would be a manual intervention and would not fall in line with Helm chart approaches.
Also, the instructions of removing the annotation , would it be documented ? or is it expected that the user goes through the logs or would it be part of the STC status ?

Events will be cleaned up after few mins

// 1. On first install, run HC in same context
// 2. Fail if HC fails
// 3. An annotation is created to save the results of the health check
if err := c.preInstallHealthChecks(cluster); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncStorageCluster is a function which is called as part of the reconciler , adding healthChecks as part of a reconcile loop would add to the latency.
Would it be possible to decouple this ?

I believe pre-checks should be done before the controllers are started. Please correct me if my understanding is wrong

@lpabon lpabon force-pushed the eng/lpabon/healthcheck-pr-plus-framework2 branch 2 times, most recently from 17a31c9 to 77dc4aa Compare August 21, 2024 16:01
@lpabon lpabon force-pushed the eng/lpabon/healthcheck-pr-plus-framework2 branch from ada7cf0 to d29910f Compare August 22, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants