-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for a child to finish in Watcher.reap_process #1036
Conversation
Watcher.reap_process() used os.waitpid(pid, os.WNOHANG) to check process status and assumed waitpid() would throw OSError with errno == errno.EAGAIN if the process was still running. This was wrong, waitpid() will return a tuple (0, 0) for living child process, so a correct way to handle child processes still running is to check for that value.
There is another place in the code with the same pattern, def reap_processes(self):
# ...
while True:
try:
# wait for our child (so it's not a zombie)
pid, status = os.waitpid(-1, os.WNOHANG)
if not pid:
break
if pid in watchers_pids:
watcher = watchers_pids[pid]
watcher.reap_process(pid, status)
# ... I am not sure how to fix this one, though. |
What about using |
Tests finished successfully, but after that build continued to hang and failed with timeout: looks like a stalled build like here #1019 (comment) or this build of master branch. |
Thanks for this write-up :) . Indeed something looks fishy here, and it would explain some issues.
If I'm remembering correctly, in
Indeed. From what I've git-blamed, this sleep predates the integration of Tornado. If the tests are passing I think we could go for it. Maybe in a separate PR though.
Yup :( . The best way right now it to relaunch those tests… |
@@ -457,12 +457,13 @@ def reap_process(self, pid, status=None): | |||
continue | |||
else: | |||
try: | |||
_, status = os.waitpid(pid, os.WNOHANG) | |||
except OSError as e: | |||
if e.errno == errno.EAGAIN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there isn't any case where waitpid raises a EAGAIN ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there isn't, EAGAIN is not listed in man page for waitpid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed
Should I do anything for that? Add a commit replacing |
So, I think it's better be removed in this case. |
Ok to remove the EAGAIN case in About the stalled builds, don't worry about it I'll relaunch them. |
try block in Arbiter.reap_processes contains a call to os.waitpid and to Watcher.reap_process. Python's os.waitpid does not throw OSError with errno set to EAGAIN, functions called by Watcher.reap_process also should not do that - so, there is no reason to check for that errno value and handle it.
Awesome contribution, thanks! |
Watcher.reap_process()
usesos.waitpid()
withWNOHANG
flag to get child process status. It had anOSError
exception handler checking iferrno
equalsEAGAIN
, in which case it would sleep a bit and callwaitpid
again. This mechanism was probably added to handle a case when child process has not died yet, sowaitpid
would block if not forWNOHANG
.However, in case a child with such a
pid
is still alive,os.waitpid()
won't throw an exception, instead it will return a tuple(0, 0)
, as stated in the docs (underlying functionwaitpid
fromlibc
does the same, returning0
in this case).As a result, reap_process() failed to correctly check if a child was still alive, causing it to
waitpid
, to fall into an endless loop, which, I beleive, is a reason for start command results in endless loop #802)waitpid
In fact,
reap_process()
would always finish immediately despite child's state. This can be a problem, for example, if a child wasSIGKILL
'ed by the Watcher, but has not exited yet, causing #1023.As in #1023, if a child is in a disk sleep state, it won't exit after
SIGKILL
. I think the easiest way to get a process hang in disk sleep is to usesshfs
. One can mount a directory from a VM usingsshfs
.Now, if we put the VM with address
192.168.59.3
on pause, processes accessing/tmp/test-mount
will hang in un-interruptible sleep. The following test code can issustrate the problem:Result:
With this fix, watcher will hang until a process exits (once VM is un-paused).
A similar test can be made for
Arbiter
: https://gist.github.com/asterite3/4e5a2bafdfbea504454b2c1388e93559