Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for a child to finish in Watcher.reap_process #1036

Merged
merged 2 commits into from
May 10, 2017

Conversation

asterite3
Copy link

Watcher.reap_process() uses os.waitpid() with WNOHANG flag to get child process status. It had an OSError exception handler checking if errno equals EAGAIN, in which case it would sleep a bit and call waitpid again. This mechanism was probably added to handle a case when child process has not died yet, so waitpid would block if not for WNOHANG.

However, in case a child with such a pid is still alive, os.waitpid() won't throw an exception, instead it will return a tuple (0, 0), as stated in the docs (underlying function waitpid from libc does the same, returning 0 in this case).

As a result, reap_process() failed to correctly check if a child was still alive, causing it to

  • before commit 7e35727b, busy-wait for a child to stop (or, with gevent-patched waitpid, to fall into an endless loop, which, I beleive, is a reason for start command results in endless loop #802)
  • in current version, unconditionally succeed after the first call to waitpid

In fact, reap_process() would always finish immediately despite child's state. This can be a problem, for example, if a child was SIGKILL'ed by the Watcher, but has not exited yet, causing #1023.

As in #1023, if a child is in a disk sleep state, it won't exit after SIGKILL. I think the easiest way to get a process hang in disk sleep is to use sshfs. One can mount a directory from a VM using sshfs.

sudo sshfs [email protected]:/tmp /tmp/test-mount -o allow_other -o IdentityFile=id_rsa

Now, if we put the VM with address 192.168.59.3 on pause, processes accessing /tmp/test-mount will hang in un-interruptible sleep. The following test code can issustrate the problem:

from circus.watcher import Watcher 
from tornado.ioloop import IOLoop
from tornado import gen

watcher = Watcher('test', 'ls', ["/tmp/test-mount/"], graceful_timeout=2.0)

@gen.coroutine
def stop_watcher_after_timeout():
    print 'sleeping'
    yield gen.sleep(0.5)
    print 'stopping watcher'
    yield watcher.stop()
    print 'stopped'

@gen.coroutine
def run():
    yield [watcher.start(), stop_watcher_after_timeout()]
IOLoop.instance().run_sync(run)
print 'done'

Result:

$ python test.py 
sleeping
stopping watcher
stopped
done # watcher finishes quickly, leaving ls running
$ ps aux|grep ls |grep test-mount
asterite  7454  0.0  0.0  16900   936 ?        Ds   02:04   0:00 ls /tmp/test-mount/

With this fix, watcher will hang until a process exits (once VM is un-paused).
A similar test can be made for Arbiter: https://gist.github.com/asterite3/4e5a2bafdfbea504454b2c1388e93559

Watcher.reap_process() used os.waitpid(pid, os.WNOHANG)
to check process status and assumed waitpid() would throw
OSError with errno == errno.EAGAIN if the process was
still running. This was wrong, waitpid() will return a
tuple (0, 0) for living child process, so a correct way
to handle child processes still running is to check for
that value.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 62.91% when pulling f1d1e6a on asterite3:reap_process_waitpid_fix into c7d79ab on circus-tent:master.

@asterite3
Copy link
Author

There is another place in the code with the same pattern, Arbiter.reap_processes():

    def reap_processes(self):
           # ...
            while True:
                try:
                    # wait for our child (so it's not a zombie)
                    pid, status = os.waitpid(-1, os.WNOHANG)
                    if not pid:
                        break

                    if pid in watchers_pids:
                        watcher = watchers_pids[pid]
                        watcher.reap_process(pid, status)
           # ...

I am not sure how to fix this one, though. Arbiter.reap_processes() is called by Arbiter.manage_watchers(), which is called periodically by Controller. So, blocking on still running child processes would break everything, right? It definitely breaks the tests.
Maybe the logic with e.errno == errno.EAGAIN should just be removed from Arbiter.reap_processes()? Why are there even two places where children are wait'ed ? Can't Arbiter rely on Watchers to reap child processes?

@asterite3
Copy link
Author

What about using tornado_sleep() instead of time.sleep() for waiting before next waitpid() attempt? It would make waiting for children cooperative. And with such a replacement in Watcher.reap_process() tests passed OK.

@asterite3
Copy link
Author

Tests finished successfully, but after that build continued to hang and failed with timeout: looks like a stalled build like here #1019 (comment) or this build of master branch.

@k4nar
Copy link
Contributor

k4nar commented Apr 12, 2017

Thanks for this write-up :) . Indeed something looks fishy here, and it would explain some issues.

Why are there even two places where children are wait'ed ? Can't Arbiter rely on Watchers to reap child processes?

If I'm remembering correctly, in Arbiter.reap_processes we only check if any of our children has died. Then if one died, we reap it in Watcher.reap_process.

What about using tornado_sleep() instead of time.sleep() for waiting before next waitpid() attempt?

Indeed. From what I've git-blamed, this sleep predates the integration of Tornado. If the tests are passing I think we could go for it. Maybe in a separate PR though.

Tests finished successfully, but after that build continued to hang and failed with timeout: looks like a stalled build like here #1019 (comment) or this build of master branch.

Yup :( . The best way right now it to relaunch those tests…

@@ -457,12 +457,13 @@ def reap_process(self, pid, status=None):
continue
else:
try:
_, status = os.waitpid(pid, os.WNOHANG)
except OSError as e:
if e.errno == errno.EAGAIN:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there isn't any case where waitpid raises a EAGAIN ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there isn't, EAGAIN is not listed in man page for waitpid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed

@asterite3
Copy link
Author

Yup :( . The best way right now it to relaunch those tests…

Should I do anything for that? Add a commit replacing time.sleep() with tornado_sleep() for example?

@asterite3
Copy link
Author

If I'm remembering correctly, in Arbiter.reap_processes we only check if any of our children has died. Then if one died, we reap it in Watcher.reap_process.

So, Arbiter.reap_processes() is a method to get notified that some child process exited and should not block if there are children still running. A branch with if e.errno == errno.EAGAIN: in except then does not mean much and is just a code that never gets executed (unless something inside psutil or pyzmq methods called from Watcher.reap_process throws OSError with EAGAIN, which should not happen).

I think it's better be removed in this case.

@k4nar
Copy link
Contributor

k4nar commented Apr 13, 2017

Ok to remove the EAGAIN case in Arbiter.reap_processes.

About the stalled builds, don't worry about it I'll relaunch them.

try block in Arbiter.reap_processes contains a call
to os.waitpid and to Watcher.reap_process. Python's
os.waitpid does not throw OSError with errno set to
EAGAIN, functions called by Watcher.reap_process
also should not do that - so, there is no reason to
check for that errno value and handle it.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 62.902% when pulling ac3fd44 on asterite3:reap_process_waitpid_fix into c7d79ab on circus-tent:master.

@k4nar
Copy link
Contributor

k4nar commented May 10, 2017

Awesome contribution, thanks!

@k4nar k4nar merged commit 880fa2d into circus-tent:master May 10, 2017
@asterite3 asterite3 deleted the reap_process_waitpid_fix branch January 22, 2020 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants