Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gateway preStop hook #1200

Conversation

norbjd
Copy link
Contributor

@norbjd norbjd commented Feb 1, 2024

Changes

  • 🐛 fix gateway preStop hook

/kind bug

While investigating to understand the issue #1118, I've noticed that the current preStop hook for the gateway (envoy) did not work at all. The reason is that curl is not installed in Envoy's default image (it might have been here at some point, but not anymore). As a result, the call to /healthcheck/fail, to make Envoy fail health checks before shutting down, did not work.

To replace curl, there were many solutions I've thought of:

  1. build a dedicated Envoy image with curl in it: too overkill for our needs (I don't think we want to maintain a separate Envoy image just for a hook). The ideal solution would have been to have curl installed in the envoy official image, but we can't expect that
  2. install a static binary of curl in an init container and copy it to envoy container using a volume: this looks way too magic. The reason we need a static binary is we can't just copy the curl binary from apt install -y curl because this needs some other dependencies
  3. same as 2, but don't use curl, e.g. install nc or socat (or other tools I'm not aware of) to be able to communicate with the /tmp/envoy.admin socket
  4. do an HTTP request to the admin endpoint with only shell tools

I've chosen 4 because it seemed to be the "least" impactful change (having an init container to install a binary used only in the hook is... meh). To communicate with Envoy admin interface using default shell tools, I had to "open" the route to /healthcheck/fail endpoint. We need to pass through the HTTP endpoint because we can't write directly to the socket.

It was not really easy to see this error, as no FailedPreStopHook event were sent, since the last command of the hook (sleep 15) did succeed. This is why I've also added a set -e command in the hook: if one command fails in the script, it will return an error, and throw a FailedPreStopHook.

Alternative

An other alternative (simpler) is to only keep the pause (sleep 15) in the preStop hook, like in this tutorial. But I'm not sure that in this case, when the pod is in Terminating state, no traffic will be redirected to it. Let's discuss in the comments if you feel this is a better solution.

Release Note

Fix gateway preStop hook

@knative-prow knative-prow bot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 1, 2024
Copy link

knative-prow bot commented Feb 1, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbjd
Once this PR has been reviewed and has the lgtm label, please assign krsna-m for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 1, 2024
Copy link

codecov bot commented Feb 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d7c88ff) 80.98% compared to head (6f5e4da) 80.98%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1200   +/-   ##
=======================================
  Coverage   80.98%   80.98%           
=======================================
  Files          18       18           
  Lines        1499     1499           
=======================================
  Hits         1214     1214           
  Misses        223      223           
  Partials       62       62           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@norbjd
Copy link
Contributor Author

norbjd commented Feb 1, 2024

/retest

@ReToCode
Copy link
Member

ReToCode commented Feb 2, 2024

I think I'd prefer the approach that @skonto proposed here: #1118 (comment). Using the health-checks as is is not really doing a proper graceful shutdown and draining existing connections.

/hold

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2024
@norbjd
Copy link
Contributor Author

norbjd commented Feb 2, 2024

Using the health-checks as is is not really doing a proper graceful shutdown

Yep, I completely agree, but this PR was more about fixing the existing behavior (hook is not doing what it is supposed to do right now, because curl is not installed) rather than adding the drain feature.

However, if you think this PR is an useless step and we should go directly to the target (rewrite that hook and correctly handle draining) then that's fine, just tell me, and I'll close the PR and directly work on draining 👍 I've started something here: #1118 (comment), but I need to do more testing.

@norbjd
Copy link
Contributor Author

norbjd commented Feb 2, 2024

Closing, superseded by #1203.

@norbjd norbjd closed this Feb 2, 2024
@norbjd norbjd deleted the fix-gateway-prestop-hook-curl-not-found branch April 7, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants