Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Leak #150

Closed
bdollard opened this issue Jul 24, 2015 · 19 comments
Closed

Connection Leak #150

bdollard opened this issue Jul 24, 2015 · 19 comments
Labels

Comments

@bdollard
Copy link

I've found a connection leak when using sente with Immutant in AJAX mode. I'm not sure if it is a sente bug or an immutant bug.

Step to reproduce.

  1. Clone https:/bdollard/sente (this has just been modified to use immutant and ajax and disable broadcasting) and run the example project
  2. leave the browser window open and and observe the number of connections in CLOSE_WAIT status gradually grow as longpolling connections time out.

Eventually the open files limit can be reached, making the server non-responsive. I did a little debugging, and it looks like on-close handler is not getting called for connections that time out.
This may be the cause of #146

@tobias
Copy link
Collaborator

tobias commented Jul 24, 2015

@bdollard - I see your example uses Immutant 2.0.1. Have you tried 2.0.2? It has a fix that specifically addresses the on-close handler not getting called, and may fix the issue.

@bdollard
Copy link
Author

I just updated the fork to use 2.0.2, It does not appear to make any difference.

@tobias
Copy link
Collaborator

tobias commented Jul 24, 2015

I can recreate the issue with your example, even with 2.0.2. This is most likely a bug in Immutant itself, not Sente. I created https://issues.jboss.org/browse/IMMUTANT-572 to track it there, and can take a look on Monday. If the fix is fairly straightforward, I should be able to release 2.0.3 pretty quickly.

@ptaoussanis
Copy link
Member

Thanks for grabbing this @tobias! Please feel free to ping me if it turns out the trouble might be on Sente's end.

Cheers :-)

@tobias
Copy link
Collaborator

tobias commented Jul 28, 2015

It turns out this is an issue with a collision of assumptions made by
Undertow (the web server inside Immutant) and Sente.

Sente assumes that if the client closes a connection, the server will
detect and close as well, without any interaction from Sente
itself. Http-kit does this, I believe because it calls NIO's select()
in a loop, which may notice and close the CLOSE_WAIT sockets
immediately. When Http-kit closes the socket, that bubbles up and
triggers Sente's :on-close handler to properly remove the channel from
the connection set.

Undertow, however, treats connections differently - if you mark the
request as async, it doesn't check it again (by default) until you try
to read from, write to, or close it. At that point it will notice the
half-closed state, and close the socket fully. It assumes the caller
will eventually write to or close the request. If you never write to
the connection, then the half-closed socket is never detected (so gets
left in CLOSE_WAIT) and the :on-close callback never gets triggered.

We're considering implementing per-channel idle timeouts in Immutant
2.1, which could be used in situations where you can't guarantee that
the channel will be used again, but I think a more reliable fix here
would be for Sente to check for an existing connection when a client
(re)connects, and close that old connection if so. Doing so would fix
the problem in Immutant, and shouldn't change behavior in Http-kit.

@ptaoussanis - would you be open to a patch that does that? If so, I
could likely provide one this week.

@ptaoussanis
Copy link
Member

@tobias Hey Toby, thanks for chasing this down (+ for the detailed report)!

I think a more reliable fix here would be for Sente to check for an existing connection when a client
(re)connects, and close that old connection if so.

Just to clarify: what would happen if a client closes a connection and then leaves (i.e. doesn't reconnect)? Will the server not continue hanging on to the closed connection?

Sente's internal logic currently depends on every server-side connection eventually triggering its :on-close for garbage collection and user-id activity monitoring.

@tobias
Copy link
Collaborator

tobias commented Jul 29, 2015

@ptaoussanis - you are correct - if a a client closes a connection and never returns, the garbage collection never gets triggered, and the socket would also remain in CLOSE_WAIT indefinitely. I hadn't fully considered that case. Some ways to catch that would be:

  1. a broadcast heartbeat from the server at some interval greater than the client's :lp-kalive-ms
  2. a go loop on the server around a channel that has a timeout on it, where that timeout is also an interval greater than :lp-kalive-ms. When the timeout expires, all :ajax connections where (< (+ ?udt-last-connected timeout) now) are closed.
  3. wait for Immutant 2.1 (early September, assuming we add per-async-channel timeouts, which is likely), and have the Immutant adapter set a timeout for each channel.

Options 1 + 2 add complexity to Sente itself, and may require exposing yet another option in make-channel-socket!, so the user can specify the timeout if they change :lp-kalive-ms on the client side.

Option 3 would also require exposing the timeout option for the case where the client changes :lp-kalive-ms, but would only need to be an option to taoensso.sente.server-adapters.immutant/immutant-adapter.

Options 1 + 2 would still require closing the connection when swapping it in conns_ on reconnect, but option 3 does not.

Of these options, I think something like option 3 might be the best choice.

Note that, as a workaround until this is fixed, an app using Immutant can broadcast a heartbeat to every uid at an interval less than :lp-kalive-ms, though that won't prevent leaking sockets that reconnected after a period less than the broadcast interval (for whatever reason), but will prevent leakage of reconnections that occur normally.

I'm also roping in @jcrossley3 to see if he has any other ideas (he pointed out the issue with clients not reconnecting offline after my comment here yesterday, and may have some additional insight).

@ptaoussanis
Copy link
Member

Hi Toby,

Have you guys considered the possibility of modding Immutant to provide a close mechanism with similar characteristics to http-kit? No idea what underlying architecture you're working with, but it might be a nice property to have if possible?

If that's tough for whatever reason, will try find some time to dig into this properly. Just a heads-up that I'm in a crunch period atm so might not be able to respond as quickly as usual. Any idea how many Immutant users are using Sente in production? Can try prioritise this if it's urgent.

@tobias
Copy link
Collaborator

tobias commented Jul 31, 2015

Peter:

One issue we have is we have to support multiple architectures - when
used outside of an application server, we use Undertow directly, and
may be able to do something similar to Http-kit since we have direct
access to Undertow classes. But inside a WildFly (which is
Undertow-based) (and soon EAP, which is built around jbossweb, a fork of Tomcat) container, we don't talk to the implementation directly,
but only to the Servlet and Websocket JSR APIs, which hide lower-level
details from us. Even if we could add this functionality when using
Undertow directly, we strive to provide the same feature set and
functionality across all the architectures. None of which is your
problem, I'm just giving you some background on where we are coming
from :)

As for the number of Sente + Immutant users in production - I don't
have a good feel for that, but I suspect it's fairly small, so I don't
think this is urgent. (@bdollard: are you using it in production?)

I have started implementing the idle timeout features for Immutant
2.1, and will probably have an incremental release to play with next
week. If that can solve this issue, the only changes to Sente would be
within the Immutant adapter itself.

@bdollard
Copy link
Author

I was using it production. However, I went back to http-kit, which is working fine, so I don't urgently need a fix. If it's going to be a while, I do think that people should be warned against using the immutant adapter until this is fixed.

@akhudek
Copy link

akhudek commented Aug 28, 2015

Regarding production use, my company is currently testing sente + immutant for production.

@ptaoussanis
Copy link
Member

@akhudek Hi Alexander, just a heads-up that I'm not sure if there's been any resolution on this issue yet.

@tobias Any word on Immutant's end? Sorry for not being more hands-on with this; just have my hands completely full atm.

@tobias
Copy link
Collaborator

tobias commented Aug 28, 2015

@ptaoussanis We'll be releasing 2.1.0 next week, which will include timeouts that should resolve this. They are already in our incremental builds, I'll put together a patch to sente today that uses the timeouts to close the sockets. Sorry I haven't done that sooner.

@ptaoussanis
Copy link
Member

Awesome, thanks Toby! Absolutely no problem.

@tobias
Copy link
Collaborator

tobias commented Aug 28, 2015

@ptaoussanis I have a question about the :lp-kalive-ms option to the cljs make-channel-socket! - should that be :lp-timeout-ms instead? The code only uses the latter, which isn't documented (ref: https:/ptaoussanis/sente/blob/dev/src/taoensso/sente.cljx#L1008).

That same timeout will be used by the Immutant adapter to cleanup lp sockets, so I need to expose it as an option when you create an immutant adapter server-side, and need to which is the correct name.

@ptaoussanis
Copy link
Member

@tobias Yes, that's a typo - the docstring should show :lp-timeout-ms. Will update now, thanks for the catch.

tobias added a commit to tobias/sente that referenced this issue Aug 28, 2015
This sets a timeout on the server-side for long-polling connections
so they get cleaned up when the client closes as part of its standard
ping mechanism or goes away.

This requires a recent Immutant incremental
build (http://immutant.org/builds/2x/) or Immutant 2.1.0, once it is
released.
tobias added a commit to tobias/sente that referenced this issue Aug 28, 2015
This sets a timeout on the server-side for long-polling connections
so they get cleaned up when the client closes as part of its standard
ping mechanism or goes away.

This requires a recent Immutant incremental
build (http://immutant.org/builds/2x/) or Immutant 2.1.0, once it is
released.
@tobias
Copy link
Collaborator

tobias commented Aug 28, 2015

@bdollard, @akhudek - I've published a fork of Sente with the above fix to clojars, would you be willing to give it a try ([org.clojars.tcrawley/sente "1.7.0-SNAPSHOT"])?

You'll also need to use a recent incremental of Immutant with it (http://immutant.org/builds/2x/), by adding the following to your project.clj:

:dependencies [[org.clojars.tcrawley/sente "1.7.0-SNAPSHOT"]
               [org.immutant/web "2.x.incremental.636"]]
:repositories [["Immutant incremental builds"
                "http://downloads.immutant.org/incremental/"]]

I've used this with the example app (with bdollard's changes to force the long-polling connections to reach full timeout), and no longer see a socket leak.

tobias added a commit that referenced this issue Sep 2, 2015
Use timeouts for lp connections to prevent socket leaks [#150]
@tobias
Copy link
Collaborator

tobias commented Sep 2, 2015

I've merged #159 to fix the thread leak issues, and updated the example app to use Immutant 2.1.0. @ptaoussanis - would you mind releasing a new alpha at your convenience?

xfeep added a commit to nginx-clojure/sente that referenced this issue Sep 3, 2015
tobias added a commit that referenced this issue Sep 3, 2015
…ks (@tobias)

This sets a timeout on the server-side for long-polling connections
so they get cleaned up when the client closes as part of its standard
ping mechanism or goes away.

This requires a recent Immutant incremental
build (http://immutant.org/builds/2x/) or Immutant 2.1.0, once it is
released.
@ptaoussanis
Copy link
Member

@tobias: Thanks, will cut a release later today - just juggling some work atm.

Closing this for now, we can reopen if any other servers pop up in future with a similar issue.

ptaoussanis added a commit that referenced this issue May 11, 2016
Not esp. necessary since #159; Immutant remains the only server I'm aware of
that'd benefit from this (?).

In any event may not be a bad idea to have this in place. Also provides
symmetry since we now have a WebSocket conn gc equivalent.
ptaoussanis added a commit that referenced this issue May 25, 2016
v1.9.0 is a significant, non-breaking upgrade focused on cleaning up Sente's
implementation to make it more robust, and to help address a couple of issues
that were recently identified (#201 and #230).

TODO -
  [ ] Test new pings, timeouts, etc. (http-kit)
  [ ] Test new pings, timeouts, etc. (Immutant)
  [ ] Final round of housekeeping
  [ ] Cut an alpha1 release

This is a squashed commit that includes:

- Significant code refactor.

- Extended ref example to incl. a pair of buttons to excercise async push
  features.

- Drop (experimental) flexi packer.

- [#161] Clojure-side Transit optimizations
  Specifically:
    1. Now cache read-handlers and write-handlers.
       Can't tell if this actually realistically helps perf since I've
       got no handler maps to test against; feedback welcome.

    2. Now cache (thread-local) writer (allows baos reuse).
       Doesn't seem to actually realistically help perf from what I can
       tell. Might nix this later since this does add some complexity
       to the impl.

- [#201] Add support for more flexible conn-type upgrade/downgrade
  Initial downgrade strategy here is simple, may or may not turn out to be
  useful; still need to check. Point was to get a more flexible base that
  we can build on in future.

- [#230] Server-side ping to help GC non-terminating WebSocket conns.

  If a WebSocket connection is closed without normal termination (e.g. as
  caused by sudden loss of power, airplane mode, etc.) - it could hang
  around in conns_ for an extended period of time until the underlying TCP
  connection was identified as dead.

  This mods Sente's keep-alive ping from client->server to server->client,
  allowing the server to identify (and auto-gc) dead but abnormally
  terminated WebSocket connections.

  Big thanks to @altV for helping to catch + diagnose this issue.

- [#230] Experimental: tweak timeout defaults to help suppress possible
  Heroku warnings.

  With the old values, abnormally terminated conns (e.g. sudden airplane
  mode) could cause conns to be forcefully terminated by Heroku with an
  "idle connection" warning message[1].

  With the new values, our gc should normally kick in before Heroku's.

  Waiting on confirmation that this commit actually helps resolve these
  warnings.

  [1] Ref. https://devcenter.heroku.com/articles/error-codes#h15-idle-connection

- [#150 #159] Allow server to gc lp conns

  Using the infrastructure in place for #230, have decided to now
  initiate Ajax timeouts from the server side instead of the client side.

  As with #230, this'll help the underlying http server gc any abnormal
  connections.

  In particular, this should (?) resolve #159 (for Immutant) w/o the need
  for the earlier workaround that was introduced.
ptaoussanis added a commit that referenced this issue Jun 4, 2016
v1.9.0 is a significant, non-breaking upgrade focused on addressing a
couple minor issues that were recently identified (#201 and #230). While
in the code, also took the opportunity to refactor some implementation
details.

- @ptaoussanis

This is a squashed commit that includes:

- Extend ref example to incl. a pair of buttons to excercise async push
  features.

- Drop (experimental) flexi packer.

- [#161] Clojure-side Transit optimizations
  Specifically:
    1. Now cache read-handlers and write-handlers.
       Can't tell if this actually realistically helps perf since I've
       got no handler maps to test against; feedback welcome.

    2. Now cache (thread-local) writer (allows baos reuse).
       Doesn't seem to actually realistically help perf from what I can
       tell. Might nix this later since this does add some complexity
       to the impl.

- [#201] Add support for more flexible conn-type upgrade/downgrade
  Initial downgrade strategy here is simple, may or may not turn out to be
  useful; still need to check. Point was to get a more flexible base that
  we can build on in future.

- [#230] Server-side ping to help GC non-terminating WebSocket conns.

  If a WebSocket connection is closed without normal termination (e.g. as
  caused by sudden loss of power, airplane mode, etc.) - it could hang
  around in conns_ for an extended period of time until the underlying TCP
  connection was identified as dead.

  This mods Sente's keep-alive ping from client->server to server->client,
  allowing the server to identify (and auto-gc) dead but abnormally
  terminated WebSocket connections.

  Big thanks to @altV for helping to catch + diagnose this issue.

  This change should also help to suppress possible "idle connection"
  Heroku warnings, Ref. https://goo.gl/zLR0Gk

- [#150 #159] Allow server to gc lp conns

  Using the infrastructure in place for #230, have decided to now
  initiate Ajax timeouts from the server side instead of the client side.

  As with #230, this'll help the underlying http server gc any abnormal
  connections.

  In particular, this should (?) resolve #159 (for Immutant) w/o the need
  for the earlier Immutant-side workaround introduced for that issue.
ptaoussanis added a commit that referenced this issue Jun 9, 2016
v1.9.0 is a significant, non-breaking upgrade focused on addressing a
couple minor issues that were recently identified (#201 and #230). While
in the code, also took the opportunity to refactor some implementation
details.

- @ptaoussanis

This is a squashed commit that includes:

- Extend ref example to incl. a pair of buttons to excercise async push
  features.

- Drop (experimental) flexi packer.

- [#161] Clojure-side Transit optimizations
  Specifically:
    1. Now cache read-handlers and write-handlers.
       Can't tell if this actually realistically helps perf since I've
       got no handler maps to test against; feedback welcome.

    2. Now cache (thread-local) writer (allows baos reuse).
       Doesn't seem to actually realistically help perf from what I can
       tell. Might nix this later since this does add some complexity
       to the impl.

- [#201] Add support for more flexible conn-type upgrade/downgrade
  Initial downgrade strategy here is simple, may or may not turn out to be
  useful; still need to check. Point was to get a more flexible base that
  we can build on in future.

- [#230] Server-side ping to help GC non-terminating WebSocket conns.

  If a WebSocket connection is closed without normal termination (e.g. as
  caused by sudden loss of power, airplane mode, etc.) - it could hang
  around in conns_ for an extended period of time until the underlying TCP
  connection was identified as dead.

  This mods Sente's keep-alive ping from client->server to server->client,
  allowing the server to identify (and auto-gc) dead but abnormally
  terminated WebSocket connections.

  Big thanks to @altV for helping to catch + diagnose this issue.

  This change should also help to suppress possible "idle connection"
  Heroku warnings, Ref. https://goo.gl/zLR0Gk

- [#150 #159] Allow server to gc lp conns

  Using the infrastructure in place for #230, have decided to now
  initiate Ajax timeouts from the server side instead of the client side.

  As with #230, this'll help the underlying http server gc any abnormal
  connections.

  In particular, this should (?) resolve #159 (for Immutant) w/o the need
  for the earlier Immutant-side workaround introduced for that issue.
ptaoussanis added a commit that referenced this issue Jun 10, 2016
v1.9.0 is a significant, non-breaking upgrade focused on addressing a
couple minor issues that were recently identified (#201 and #230). While
in the code, also took the opportunity to refactor some implementation
details.

- @ptaoussanis

This is a squashed commit that includes:

- Extend ref example to incl. a pair of buttons to excercise async push
  features.

- Drop (experimental) flexi packer.

- [#161] Clojure-side Transit optimizations
  Specifically:
    1. Now cache read-handlers and write-handlers.
       Can't tell if this actually realistically helps perf since I've
       got no handler maps to test against; feedback welcome.

    2. Now cache (thread-local) writer (allows baos reuse).
       Doesn't seem to actually realistically help perf from what I can
       tell. Might nix this later since this does add some complexity
       to the impl.

- [#201] Add support for more flexible conn-type upgrade/downgrade
  Initial downgrade strategy here is simple, may or may not turn out to be
  useful; still need to check. Point was to get a more flexible base that
  we can build on in future.

- [#230] Server-side ping to help GC non-terminating WebSocket conns.

  If a WebSocket connection is closed without normal termination (e.g. as
  caused by sudden loss of power, airplane mode, etc.) - it could hang
  around in conns_ for an extended period of time until the underlying TCP
  connection was identified as dead.

  This mods Sente's keep-alive ping from client->server to server->client,
  allowing the server to identify (and auto-gc) dead but abnormally
  terminated WebSocket connections.

  Big thanks to @altV for helping to catch + diagnose this issue.

  This change should also help to suppress possible "idle connection"
  Heroku warnings, Ref. https://goo.gl/zLR0Gk

- [#150 #159] Allow server to gc lp conns

  Using the infrastructure in place for #230, have decided to now
  initiate Ajax timeouts from the server side instead of the client side.

  As with #230, this'll help the underlying http server gc any abnormal
  connections.

  In particular, this should (?) resolve #159 (for Immutant) w/o the need
  for the earlier Immutant-side workaround introduced for that issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants