Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O operation on closed file #979

Open
thefab opened this issue May 4, 2016 · 8 comments
Open

I/O operation on closed file #979

thefab opened this issue May 4, 2016 · 8 comments

Comments

@thefab
Copy link
Contributor

thefab commented May 4, 2016

At Meteo-France, we were happy with circus 0.11.1.

We upgraded recently on 0.13.0

And we have a nasty bug very difficult to reproduce.

Context:

  • we have plenty (> 100) of watchers
  • watchers often exit with a normal exit code and circus relaunch them

After a few hours, some watchers are in bad state:

  • circusctl numprocesses our_watcher => 0
  • circusctl options our_watcher => show a fixed numprocesses=1 (from the configuration file and it's never updated dynamically)
  • circus don't even try to spawn a new process (no log at all about this watcher, even in debug mode)
  • I verified with "circusctl ipython" that the worker attributes are ok (respawn...) and by reading the code and testing corresponding values, circus should spawn a new process (but...)
  • circusctl incr our_watcher (just to test) returns and interesting:
error: command '{"command":"incr","properties":{"waiting":false,"name":"stage1"},"id":"ada1c018e92c42b2997668931e1d8dbc"}': 
I/O operation on closed file
  • manage_processes() method of the corresponding watcher is called (I added a "print" to be sure)

I'm still debugging this one. But I think that there is a hidden exception somewhere (probably around stdout/stderr management to explain "I/O operation on closed file"). Probably a kind of concurrency issue.

If someone has ideas to share :-)

Thanks

@k4nar
Copy link
Contributor

k4nar commented May 4, 2016

Thanks for this report. There is a pretty big changeset between 0.11.1 and 0.13, I guess it won't be easy to find the guilty change. Maybe it has something to do with this PR: #885 ?

@thefab
Copy link
Contributor Author

thefab commented May 4, 2016

yep

#933 seems to be linked too

@thefab
Copy link
Contributor Author

thefab commented May 9, 2016

some progress here:

it's the same bug than #933
(#885 is ok)

with "circusctl ipython" when the problem occurred

I see:

# stage2 is the watcher object

In [30]: stage2.status()
Out[30]: 'active'

In [31]: stage2.processes
Out[31]: {}

In [32]: stage2.numprocesses
Out[32]: 1

In [33]: stage2.respawn
Out[33]: True

more interesting (linked with #933):

In [22]: stage2.stream_redirector
Out[22]: <circus.stream.redirector.Redirector at 0x7f2343504f90>

In [23]: stage2.stream_redirector.pipes
Out[23]: 
{254: ('stdout',
  <circus.process.Process at 0x7f2343504190>,
  <closed file '<fdopen>', mode 'rb' at 0x7f2341e7c420>)}

In [24]: stage2.stream_redirector._active
Out[24]: {254: <circus.stream.redirector.Handler at 0x7f2343504f10>}

In [25]: stage2.stream_redirector.running
Out[25]: True

In [26]: stage2.stream_redirector.redirect
Out[26]: 
{'stderr': <circus.stream.file_stream.FileStream at 0x7f234378a810>,
 'stdout': <circus.stream.file_stream.FileStream at 0x7f234378a7d0>}

In [27]: stage2.stream_redirector.pipes[254][2].fileno()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/synext/lib/python2.7/site-packages/circus-0.13.0-py2.7.egg/circus/commands/ipythonshell.pyc in <module>()
----> 1 stage2.stream_redirector.pipes[254][2].fileno()

ValueError: I/O operation on closed file

So in this particular case, we have a closed "stream_redirector" so the watcher can't spawn another process because of "stream_redirector.start()" call wich fails at "pipe.fileno()".

But I don't understand the root cause. The problem seems to occur when the managed process exit by itself. But it's not reproductible easily. So there is probably a concurrency issue here.

Another idea ?

@thefab
Copy link
Contributor Author

thefab commented May 9, 2016

Note: when everything is ok, I have two "stream_redirector" for this watcher (one for stdout and one for stderr). The last action for the watcher seems to be "reaping process xxxxx" (in logs)

@k4nar
Copy link
Contributor

k4nar commented May 9, 2016

Maybe we are ignoring errors when closing the stream_redirector?

@thefab
Copy link
Contributor Author

thefab commented May 9, 2016

I still don't understand :-/ Maybe a kind of race conditions between file descriptors of different watchers. Closed in one watcher and reused by another one at the OS level and circus is out of sync at the stream_redirector level and close the bad one ?

I'm lost :-(

thefab added a commit to thefab/circus that referenced this issue May 11, 2016
avoir the main and blocking exception about "I/O operation on closed file"
with minimal changes and risks
@thefab
Copy link
Contributor Author

thefab commented May 11, 2016

some progress with the references commit but after a few hours with this commit, I have only one exception in the log file:

2016-05-11 06:01:08 tornado.general[2779] [DEBUG] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/home/synext/lib/python2.7/site-packages/tornado-4.1-py2.7-linux-x86_64.egg/tornado/ioloop.py", line 691, in remove_handler
    self._impl.unregister(fd)
  File "/home/synext/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 115, in unregister
    return self._poller.unregister(fd)
  File "/home/synext/lib/python2.7/site-packages/zmq/sugar/poll.py", line 71, in unregister
    idx = self._map.pop(socket)
KeyError: 273

But our processes are alive :-)

I still don't have the root cause (probably a concurrency issue between process reaping and process killing (after lifetime) but there is still something nasty).

The referenced commit just limit the consequences.

Note: #933 is a duplicate of this one because after fixing a logging configuration issue in a circus hook, I have the same exception.

@k4nar
Copy link
Contributor

k4nar commented Jun 11, 2016

I fixed a similar issue with the streams in the PR #984. I'll let you know when it's merged.

k4nar pushed a commit that referenced this issue Aug 12, 2016
avoir the main and blocking exception about "I/O operation on closed file"
with minimal changes and risks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants