Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky parallel/test-tls-sni-option #26910

Closed
mscdex opened this issue Mar 25, 2019 · 12 comments
Closed

Investigate flaky parallel/test-tls-sni-option #26910

mscdex opened this issue Mar 25, 2019 · 12 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX. tls Issues and PRs related to the tls subsystem.

Comments

@mscdex
Copy link
Contributor

mscdex commented Mar 25, 2019

  • Version: master
  • Platform: macos
  • Subsystem: tls

On osx1011 (test-macstadium-macos10.11-x64-1):

18:54:26 not ok 1968 parallel/test-tls-sni-option
18:54:26   ---
18:54:26   duration_ms: 0.663
18:54:26   severity: fail
18:54:26   exitcode: 1
18:54:26   stack: |-
18:54:26     events.js:171
18:54:26           throw er; // Unhandled 'error' event
18:54:26           ^
18:54:26     
18:54:26     Error: read ECONNRESET
18:54:26         at TLSWrap.onStreamRead (internal/stream_base_commons.js:182:27)
18:54:26     Emitted 'error' event at:
18:54:26         at emitErrorNT (internal/streams/destroy.js:91:8)
18:54:26         at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
18:54:26         at processTicksAndRejections (internal/process/task_queues.js:81:17)
@mscdex mscdex added tls Issues and PRs related to the tls subsystem. macos Issues and PRs related to the macOS platform / OSX. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Mar 25, 2019
@MylesBorins
Copy link
Contributor

I've been seeing this locally a bunch

@lpinca
Copy link
Member

lpinca commented Apr 1, 2019

The culprit seems to be this commit: 42dbaed.

Easily reproducible on my machine with:

./tools/test.py -J --repeat=1000 test/parallel/test-tls-sni-option.js

If --node-args=--tls-max-v1.2 is used I can no longer reproduce.

cc: @sam-github

@sam-github
Copy link
Contributor

@lpinca What kind of machine do you have? I can't repro with your test.py invocation, or by running:

while ./out/Release/node test/parallel/test-tls-sni-option.js; do; done;

three times in parallel, on linux x64. That said, this test looks like it depends on the order of completion of the handshakes on client relative to server, I am prepping a PR.

@sam-github
Copy link
Contributor

I read the top in more detail, it's on OS X, I''ll try on that.

@lpinca
Copy link
Member

lpinca commented Apr 19, 2019

Yes I can't reproduce on Linux, only on macOS.

@lpinca
Copy link
Member

lpinca commented Apr 19, 2019

The following patch fixes it but I'm not sure why.

diff --git a/test/parallel/test-tls-sni-option.js b/test/parallel/test-tls-sni-option.js
index 3a6a231b47..6ccf631169 100644
--- a/test/parallel/test-tls-sni-option.js
+++ b/test/parallel/test-tls-sni-option.js
@@ -115,7 +115,7 @@ let clientError;
 
 const server = tls.createServer(serverOptions, function(c) {
   serverResults.push({ sni: c.servername, authorized: c.authorized });
-  c.end();
+  // c.end();
 });
 
 server.on('tlsClientError', function(err) {
@@ -137,7 +137,8 @@ function startTest() {
         client.authorizationError &&
          (client.authorizationError === 'ERR_TLS_CERT_ALTNAME_INVALID'));
 
-      next();
+      client.on('close', next);
+      client.end();
     });
 
     client.on('error', function(err) {

refack added a commit to refack/node that referenced this issue Apr 21, 2019
PR-URL: nodejs#27225
Refs: nodejs#26910
Refs: nodejs#27219
Refs: nodejs#26938
Refs: nodejs#23089
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yongsheng Zhang <[email protected]>
@sam-github
Copy link
Contributor

Without injecting some tracing into the TLS code, I don't know exactly why that is happening, but its not surprising. TLS1.3 continues to exchange some packages after handshake, and handshake completes with different timings, so for some extra messages to arrive after the socket is closed, but for it to be racy, doesn't surprise me. With your change, the close goes from client, to server, then back to client, emits close, and then it's done, so the roundtrip occurs. I fixed bugs like this in other tests, but this one being only a bit racy I never noticed.

@lpinca
Copy link
Member

lpinca commented Apr 23, 2019

Yes but isn't it strange that the race only occurs (or is exacerbated) on macOS?

@sam-github
Copy link
Contributor

Systems are different. localhost TCP on Linux doesn't actually do a full TCP protocol, for example, it knows packet loss is impossible because its all in the local host's memory. Also, after a write() on TCP, which process gets scheduled depends on the OS. TLS connections that exchange no data and immediately do an ungraceful close before exchanging data are racy, and our unit tests do a lot of them, but real-world applications establish TLS connections to exchange data. The OpenSSL devs aren't happy that we are exchanging TLS data in the handshake complete callback, that's an issue, too, and perhaps related to this, because it causes a few more packets to be exchanged before the TLS close alert.

I could reintroduce my packet tracing to see exactly what packet is being sent, but that will take a bit to PR, and I'm not sure what the API should look like. I could wireshark, but since the link is encrypted and we lack the feature of dumping the master secret, I won't see the actual packet. I've been meaning to PR an improvement to that, too. Both are open feature requests.

Since I made a number of fixes like this, but with tests that failed more like 1/3 of the time, I'm OK with just doing a graceful TLS end. If we want to wait until TLS debugging features land, that's OK with me, too.

@lpinca
Copy link
Member

lpinca commented Apr 23, 2019

I'm fine with anything as long as the test is not changed from its original intent.
Another possible way to "fix" it without closing from the client is to add a listener for the 'error' event on socket assuming that the error is sometimes expected as per discussion.

@sam-github
Copy link
Contributor

The purpose of the test was to check the assertions related to SNI, which isn't affected by how the connection is destroyed.

@sam-github
Copy link
Contributor

I reworked my refactor in #27300 so that it uses the same close sequence as original, and the test appears to be stable.

@lpinca lpinca closed this as completed Apr 26, 2019
MylesBorins pushed a commit that referenced this issue May 16, 2019
PR-URL: #27225
Refs: #26910
Refs: #27219
Refs: #26938
Refs: #23089
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yongsheng Zhang <[email protected]>
MylesBorins pushed a commit that referenced this issue May 16, 2019
PR-URL: #27225
Refs: #26910
Refs: #27219
Refs: #26938
Refs: #23089
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yongsheng Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants