Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WS does not handle receiver errors correctly #1898

Closed
pimterry opened this issue Jun 7, 2021 · 4 comments · Fixed by #1899
Closed

WS does not handle receiver errors correctly #1898

pimterry opened this issue Jun 7, 2021 · 4 comments · Fixed by #1899

Comments

@pimterry
Copy link
Contributor

pimterry commented Jun 7, 2021

Extracted from #1892

Right now, when WS sees an error, it emits an error event, stops emitting data, and immediately destroys the socket (here). The destroy results in a close event firing, using the close code from the error that was thrown.

As far as I can tell, this isn't totally illegal, but it's not the correct behaviour.

Section 7 in the websocket RFC covers a bunch of definitions about how websockets should be cleanly or unhappily closed. From my reading, it says that when you want to fail an established websocket connection due to an error:

  • You should first send a close frame to the remote peer (see 7.1.7)
  • You should ignore any more incoming data, including remote close frames (see 7.1.7).
  • If the local peer is a server then it should half-close (SHUT_WR in TCP speak, socket.end() in node) the TCP connection, but not close it completely (not socket.destroy() it) until all incoming data has been read (see 7.1.1).
  • If the local peer is a client then it shouldn't shutdown the connection at all, and instead it should wait for the server to send a close frame and initiate TCP shutdown before it half-closes the socket itself (i.e. it should wait for an 'end' event, then call .end() itself), unless a timeout has passed which suggests the server isn't going to shut the socket (see 7.1.1).
  • When the connection is fully closed, it should be considered to be closed with close code 1006. The connection should only be considered to be closed with other error close codes if the peer received that error code (see 7.1.5).

This is designed to make sure that all peers know why a connection closed, and to make sure the underlying TCP connection is closed cleanly wherever possible.

Hard closing connections is allowed, but only if strictly required (e.g. if the server is under attack) or if there's good reason to believe the remote client can't handle normal shutdown (e.g. the TCP connection itself is broken already).

The main practical differences I see are that right now when there's an error, WS:

  • doesn't send a close code to the remote peer.
  • always completely destroys the socket immediately.
  • fires a 'close' event using the sent status code, not 1006.

I think I'm translating between the TCP-based descriptions in the spec and the node world correctly here, but it would be good to check that - it might be that node handles some of this for us under the hood.

@lpinca
Copy link
Member

lpinca commented Jun 7, 2021

The main practical differences I see are that right now when there's an error, WS:

  • doesn't send a close code to the remote peer.
  • always completely destroys the socket immediately.
  • fires a 'close' event using the sent status code, not 1006.

See #1892 (comment) and #1892 (comment). Waiting for the other peer when an error occurs is a no go because it could be done on purpose.

We could try to send the close frame and change the close code to 1006 when an error occurs but

  1. The first part it's a waste of resources as per Expose error status codes #1892 (comment).
  2. The second part (changing the status code to 1006) is a big breaking change.

@lpinca
Copy link
Member

lpinca commented Jun 7, 2021

I also think that closing with 1006 is a bit odd. Imagine that an invalid frame is received on the server. A 1002 status code is sent to the client. The server must stop reading data so the connection is closed with 1002 on the client and 1006 on the server but in my opinion that info is way more valuable on the server than on the client. Assuming that no error occurs in this case (like in most WebSocket implementations) that info is simply lost and the connection is closed with a generic 1006 status code.

@pimterry
Copy link
Contributor Author

pimterry commented Jun 7, 2021

Waiting for the other peer when an error occurs is a no go because it could be done on purpose.

I'm not sure what this means, can you explain? Is there a way this can be abused? I'm a bit surprised by that, since it's the standard mechanism from the spec and this is how it appears to be implemented everywhere else.

We could try to send the close frame ... but it's a waste of resources

I can see how it could be difficult in some cases, but it doesn't seem like a waste or overkill to me.

It's useful to look at this in analogy to HTTP. You can make the exact same argument that HTTP servers should not return error status codes, and should just close connections on invalid requests instead: it lets you close connections quicker, it's more efficient, and the connection might be broken by sufficiently invalid requests anyway. The spec text is similar, HTTP servers can silently close connections in response to errors.

Still, nobody does this - it's very helpful for clients to receive clear error responses when errors occur, wherever possible.

Since the websocket spec encourages sending error codes, and it's easily possible, it would be good to do so. Trying to send a close frame and then timing out seems reasonable, and fits the spec's guidelines for when to omit it.

Note that any performance impact only affects errored websocket connections, and that the current implementation has performance & stability implications for websocket clients today too. Calling destroy() now on the client side instead of cleanly closing means every ws websocket client has to keep the socket and the port it used reserved in a TIME_WAIT state until the the TCP timeout (1 minute, on Linux). In theory if you have a machine that makes many outbound websocket connections using ws and hits errors frequently (around 16,000 in a minute I think) then it can use up all its ephemeral ports and the entire machine will be completely unable to make TCP connections until the timeout finishes. Not good.

The WebSocket close handshake which isn't implemented here is designed to avoid this (by cleanly closing starting on the server side, where incoming connections will reuse ports).

The second part (changing the status code to 1006) is a big breaking change ... I also think that closing with 1006 is a bit odd.

To be clear, this part isn't a SHOULD - this is the definition of the close code of a websocket.

I don't know about other node implementations, but this is exactly what every browser does. It's how websockets work. Any implementation that doesn't do this is not following the spec.

I agree it's a breaking change, which isn't great. Right now though WS conflicts with the websockets spec and browser implementations here, and probably all other implementations too. I think the breaking impact is limited, since it only affects people using the 'close' event to detect specific errors from close codes - it doesn't affect normal happy-path usage at all.

in my opinion that info is way more valuable on the server than on the client ... Assuming that no error occurs in this case (like in most WebSocket implementations) that info is simply lost and the connection is closed with a generic 1006 status code.

I agree it's useful information on the server, but exposing remote peer errors in an error event instead means it's not lost. That seems like a perfectly good way to capture this info on the server side.

Browsers do expose close codes in this way, and do fire error events for these cases. I'm not sure about other implementations. Browsers only include basic details in the JS error for security reasons but the detailed information is logged to the console.

Putting the local close code in the close event meanwhile makes it ambiguous (you can't tell from a close event alone who sent the close frame) which can be very confusing. Note that the close event is always the remote peer's close code in every other case. If a ws server sends a close frame with code 4000, and the client replies with 4040, right now ws emits a close event on the server for 4040. That's totally valid behaviour, and that's correct result! The code should always be either the code from the remote peer, 1005 (cleanly closed with no code) or 1006 (abnormal close).

@lpinca
Copy link
Member

lpinca commented Jun 7, 2021

I'm not sure what this means, can you explain? Is there a way this can be abused?

The client might never call socket.end() after sending an invalid frame keeping the connection open indefinitely. Yes, this could also be done without triggering an error on the other peer but closing as soon as possible when an error occurs seems logical.

It's useful to look at this in analogy to HTTP. You can make the exact same argument that HTTP servers should not return error status codes, and should just close connections on invalid requests instead: it lets you close connections quicker, it's more efficient, and the connection might be broken by sufficiently invalid requests anyway. The spec text is similar, HTTP servers can silently close connections in response to errors.

Node.js core default behavior is to destroy the socket immediately when a parse error occurs. See https:/nodejs/node/blob/826566e78a4f6248254b7dd6fb85f37ea3978530/lib/_http_server.js#L682. The response was not even tried to be written until not so long ago.

I agree it's a breaking change, which isn't great. Right now though WS conflicts with the websockets spec and browser implementations here, and probably all other implementations too. I think the breaking impact is limited, since it only affects people using the 'close' event to detect specific errors from close codes - it doesn't affect normal happy-path usage at all.

True. The only non compliant bit is the close code when a Receiver error occurs which might be treated as a bug fix.

Browsers do expose close codes in this way, and do fire error events for these cases.

I just tried on Chrome and it's true. This is something new to me. These errors were all silently swallowed in browsers.

Note that the close event is always the remote peer's close code in every other case.

I know. Let me think about unwanted and unexpected consequences for a little longer. The fix is trivial.

lpinca added a commit that referenced this issue Jun 8, 2021
Instead of destroying the socket, try to close the connection cleanly
if an error (such as a data framing error) occurs after the opening
handshake has completed.

Also, to comply with the specification, use the 1006 status code if no
close frame is received, even if the connection is closed due to an
error.

Fixes #1898
lpinca added a commit that referenced this issue Jun 9, 2021
Instead of destroying the socket, try to close the connection cleanly
if an error (such as a data framing error) occurs after the opening
handshake has completed.

Also, to comply with the specification, use the 1006 status code if no
close frame is received, even if the connection is closed due to an
error.

Fixes #1898
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants