Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden shutdown logic #1037

Merged
merged 10 commits into from
Apr 21, 2015
Merged

Harden shutdown logic #1037

merged 10 commits into from
Apr 21, 2015

Conversation

torarnv
Copy link
Contributor

@torarnv torarnv commented Apr 8, 2015

Fixes in my quest to make the daemon shut down cleanly on Ctrl+C. Having the daemon stuck, especially without any logging/info to the user of why it's stuck, gives a bad impression to new users, as it's not evident that pressing Ctrl+C a second time will not interrupt an important task and corrupt something.

Spawned from ipfs/ipfs-webui#28

@jbenet jbenet added backlog and removed ready labels Apr 8, 2015
@@ -509,8 +514,10 @@ func (i *cmdInvocation) setupInterruptHandler() {
case 0:
fmt.Println(shutdownMessage)
go func() {
wg.Add(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is racey-- the Add should go in the same goroutine as the wg gets initialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the add always needs to happen before a goroutine is spun off

@jbenet
Copy link
Member

jbenet commented Apr 8, 2015

@torarnv this is looking good!

on the wait group, maybe something like:

type IntrCloser struct {
  c  chan os.Signal
  wg sync.WaitGroup
}

func NewIntrCloser() *IntrCloser {
  var ic := &IntrCloser{}
  ic.c = allInterruptSignals()
  ic.wg.Add(1)
  return ic
}

func (ic *IntrCloser) Done() {
  ic.wg.Done()
}

func (ic *IntrCloser) Close() error {
  signal.Stop(c)
  ic.wg.Wait()
  return nil
}


func (i *cmdInvocation) setupInterruptHandler() *sync.WaitGroup {

  intrc := NewIntrCloser()

  ...
  // make sure to call intrc.Done()
  ...

  return intrc
}

@jbenet
Copy link
Member

jbenet commented Apr 8, 2015

can actually just replace the waitgroup with a channel:

type IntrCloser struct {
  c  chan os.Signal
  done chan struct{}
}

func NewIntrCloser() *IntrCloser {
  var ic := &IntrCloser{}
  ic.c = allInterruptSignals()
  ic.done = make(chan struct{})
  return ic
}

func (ic *IntrCloser) Done() {
  ic.done<- struct{}{}
}

func (ic *IntrCloser) Close() error {
  signal.Stop(c)
  <-ic.exit
  return nil
}


func (i *cmdInvocation) setupInterruptHandler() *sync.WaitGroup {

  intrc := NewIntrCloser()

  ...
  // make sure to call intrc.Done()
  ...

  return intrc
}

@jbenet
Copy link
Member

jbenet commented Apr 8, 2015

@torarnv done with CR. gave suggestion on how to deal with the waitgroup problem, but no need to do it that way.

@jbenet
Copy link
Member

jbenet commented Apr 12, 2015

hey @torarnv! i'd love to merge this in soon. any chance you could handle the CR comments? else i can do it.

@torarnv
Copy link
Contributor Author

torarnv commented Apr 13, 2015

@jbenet Hey! Sorry, had some stuff come up. I'll look at this today!

@whyrusleeping whyrusleeping added status/in-progress In progress and removed backlog labels Apr 13, 2015
@jbenet
Copy link
Member

jbenet commented Apr 13, 2015

@torarnv will want to rebase-- i think diverged

@torarnv
Copy link
Contributor Author

torarnv commented Apr 14, 2015

#1081 got in there, thought github might detect the dependency?

@jbenet
Copy link
Member

jbenet commented Apr 14, 2015

@torarnv when we merge the other one, the merged commits will disappear from this listing

@torarnv
Copy link
Contributor Author

torarnv commented Apr 17, 2015

Need to double-check the tests, but general feedback welcome!

@torarnv torarnv force-pushed the harden-shutdown-logic branch 2 times, most recently from a04cad4 to ef409c0 Compare April 17, 2015 16:47
@jbenet
Copy link
Member

jbenet commented Apr 17, 2015

btw @torarnv one thing to do is run make test_all_commits locally.

(every commit has to pass so that we can rebase easily in the future-- it makes it really annoying if we do not follow this practice.)

@jbenet
Copy link
Member

jbenet commented Apr 17, 2015

(will CR later tonight)

@torarnv
Copy link
Contributor Author

torarnv commented Apr 18, 2015

@jbenet all tests pass locally for each commit in the PR 🎆

select {
case <-ctx.Context.Done():
log.Info("Gracefully shut down daemon")
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of the default here? Wouldnt we want to wait and print the log message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's based on the possibility that the node/daemon may shut down through other means, eg through an API call or similar that wouldn't cancel the context, only close the node. In that case we wouldn't want to block. But if that's not really an option, or we do want to go through the context for those shutdowns as well, it shouldn't have a default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, thanks!

@whyrusleeping
Copy link
Member

If I could get a couple comments on the new struct and functions you added, that would be awesome. Otherwise this all looks good to me.

@torarnv
Copy link
Contributor Author

torarnv commented Apr 19, 2015

You mean comments from me? Or you want to comment on them? 😃

@@ -475,59 +469,62 @@ func writeHeapProfileToFile() error {
return pprof.WriteHeapProfile(mprof)
}

// listen for and handle SIGTERM
func (i *cmdInvocation) setupInterruptHandler() {
type IntrHandler struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on this struct

@whyrusleeping
Copy link
Member

Pointed out the two i'd like commented, your commit messages are far nicer than mine, thank for you that :) . I mostly like having the comments on structs and functions to make it easy to grok the code through GoDoc

@jbenet
Copy link
Member

jbenet commented Apr 20, 2015

Will be great to have this in.

@jbenet
Copy link
Member

jbenet commented Apr 20, 2015

(this will probably need rebasing on master, sorry. happy to rebase it for you if desired)

This reverts commit f74e71f.

The 'Online' flag of the command context does not seem to be set in
any code paths, at least not when running commands such as 'ipfs daemon'
or 'ipfs ping'. The result after f74e71f is that we never shutdown
cleanly, as we'll always os.Exit(0) from the interrupt handler.

The os.Exit(0) itself is also dubious, as conceptually the interrupt
handler should ask whatever is stalling to stop stalling, so that
main() can return like normal. Exiting with -1 in error cases where
the interrupt handler is unable to stop the stall is fine, but the
normal case of interrupting cleanly should exit through main().
The server may stay alive for quite a while due to waiting on
open connections to close before shutting down. We should
find ways to terminate these connections in a more controlled
manner, but in the meantime it's helpful to be able to see
why a shutdown of the ipfs daemon is taking so long.
When closing a node, the node itself only takes care of tearing down
its own children. As corehttp sets up a server based on a node, it
needs to also ensure that the server is accounted for when determining
if the node has been fully closed.
Once the server is asked to shut down, we stop accepting new
connections, but the 'manners' graceful shutdown will wait for
all existing connections closed to close before finishing.

For keep-alive connections this will never happen unless the
client detects that the server is shutting down through the
ipfs API itself, and closes the connection in response.

This is a problem e.g. with the webui's connections visualization,
which polls the swarm/peers endpoint once a second, and never
detects that the API server was shut down.

We can mitigate this by telling the server to disable keep-alive,
which will add a 'Connection: close' header to the next HTTP
response on the connection. A well behaving client should then
treat that correspondingly by closing the connection.

Unfortunately this doesn't happen immediately in all cases,
presumably depending on the keep-alive timeout of the browser
that set up the connection, but it's at least a step in the
right direction.
If a command invocation such as 'daemon' is interrupted, the interrupt
handler asks the node to close. The closing of the node will result in
the command invocation finishing, and possibly returning from main()
before the interrupt handler is done. In particular, the info logging
that a graceful shutdown was completed may never reach reach stdout.

As the whole point of logging "Gracefully shut down." is to give
confidence when debugging that the shutdown was clean, this is
slightly unfortunate.

The interrupt handler needs to be set up in main() instead of Run(),
so that we can defer the closing of the interrupt handler until just
before returning from main, not when Run() returns with a streaming
result reader.
The context passed on from main() may change before we hit callCommand,
so setting it in Parse is a bit optimistic.
The context may be cancelled while a request is in flight. We need to
handle this and cancel the request. The code is based on the ideas
from https://blog.golang.org/context
When the response includes the X-Chunked-Output header, we treat that
as channel output, and fire up a goroutine to decode the chunks. This
routine need to look for context cancellation so that it can exit
cleanly.
Instead of assuming the command is the daemon command and closing
the node, which resulted in bugs like ipfs#1053, we cancel the context
and let the context children detect the cancellation and gracefully
clean up after themselves.

The shutdown logging has been moved into the daemon command, where
it makes more sense, so that commands like ping will not print out
the same output on cancellation.
Instead of just terminating right there and then, we cancel the
context, and let the daemon exit cleanly. This make take a few
seconds, as the node builder and its child processes do not
care too much about the context state while building nodes,
but this can be improved by injecting checks for ctx.Done()
before time-consuming steps.
@torarnv
Copy link
Contributor Author

torarnv commented Apr 20, 2015

Added comments and rebased.

@jbenet jbenet added sprint and removed status/in-progress In progress labels Apr 20, 2015
@jbenet jbenet self-assigned this Apr 20, 2015
@jbenet jbenet mentioned this pull request Apr 20, 2015
34 tasks
jbenet added a commit that referenced this pull request Apr 21, 2015
@jbenet jbenet merged commit db56c0f into ipfs:master Apr 21, 2015
@jbenet jbenet added sprint and removed sprint labels Apr 21, 2015
@jbenet
Copy link
Member

jbenet commented Apr 21, 2015

@torarnv this is now in master. thank you very much!


@torarnv
Copy link
Contributor Author

torarnv commented Apr 21, 2015

Awesome! 🎉 Let me know if there are regressions the testsuite didn't catch, or similar issues/related areas that I could tackle as an extension of this.

@jbenet
Copy link
Member

jbenet commented Apr 22, 2015

@torarnv no issues so far.

so-- related issues are daemon init + startup. there's enough issues around this to warrant a whole sprint: https:/ipfs/go-ipfs/labels/daemon%20%2B%20init may be worth picking small ones of those or scoping out what changes need to happen (there's plenty of cmdslib crap to do too there-- cc @whyrusleeping )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants