child_process: emit .send() errors in next tick #4277

phillipj · 2015-12-14T20:22:14Z

Triaging a couple of FIXMEs from @bnoordhuis in internal/child_process.

Should this be considered a fix for the next major release as it changes from emitting errors synchronously, to async in the next tick?

jasnell · 2015-12-14T20:28:29Z

lib/internal/child_process.js

@@ -510,7 +510,7 @@ function setupChannel(target, channel) {
 if (typeof callback === 'function') {
 process.nextTick(callback, ex);
 } else {
- this.emit('error', ex); // FIXME(bnoordhuis) Defer to next tick.
+ process.nextTick(emitErrorNT, this, ex);


wondering if it's worthwhile to create the separate emitErrorNT function.

This ought to work:

process.nextTick(()=>this.emit('error', ex));

Sure, more explicit 👍 Saw this in a lot of other modules, so thought it was the preferred pattern.

I think we're starting to move in the direction of using arrow functions where possible to improve readability and, in some cases, performance. Either approach would work here but the arrow function just seems more explicit

Fishrock123 · 2015-12-21T15:35:42Z

Should this be considered a fix for the next major release as it changes from emitting errors synchronously, to async in the next tick?

Yep.

trevnorris · 2015-12-28T18:33:49Z

lib/internal/child_process.js

@@ -672,7 +672,6 @@ function setupChannel(target, channel) {
 return control;
 }

-


unnecessary whitespace change.

trevnorris · 2015-12-28T18:34:52Z

One comment but LGTM otherwise if CI is happy.

CI: https://ci.nodejs.org/job/node-test-pull-request/1095/

phillipj · 2015-12-29T22:54:27Z

Updated, thanks for reviewing!

CI didn't pass, I couldn't find any hints in the Jenkins console output.. Any other places to look for hints?

trevnorris · 2015-12-29T23:14:33Z

@phillipj Here's one of the errors (from https://ci.nodejs.org/job/node-test-binary-windows/416/RUN_SUBSET=0,VS_VERSION=vs2015,label=win10/tapTestReport/test.tap-15/)

  not ok 15 test-child-process-send-err.js
# c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win10\test\parallel\test-child-process-send-err.js:27
# channel.open(stream);
# ^
# 
# Error: EINVAL: invalid argument, uv_pipe_open
# at Error (native)
# at Object.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win10\test\parallel\test-child-process-send-err.js:27:9)
# at Module._compile (module.js:408:26)
# at Object.Module._extensions..js (module.js:415:10)
# at Module.load (module.js:354:32)
# at Function.Module._load (module.js:311:12)
# at Function.Module.runMain (module.js:440:10)
# at startup (node.js:141:18)
# at node.js:997:3

I checked previous CI runs and this wasn't failing in those. So I'm going to assume it's related, but not sure why.

phillipj · 2015-12-29T23:24:25Z

Thanks @trevnorris. I'll get a win10 machine setup and see if I can reproduce that failure.

phillipj · 2016-01-07T19:44:16Z

Alright, so luckily no problem in reproducing the windows failure. uv_pipe_open in libuv/win/pipe.c are quite different from its libuv/unix/pipe.c equivalent.

Other than banging my head against the wall, I haven't made much progress. I've tried to figure out what exactly the uv_file parameter of uv_pipe_open() expects. In test-child-process-send-err.js I'm currently sending in an common.ArrayStream, which is obviously not a file handle. I have tried sending in an actual file handle, created with fs.openSync(), which sadly doesn't pass the asserts in win/pipe.c either.

Me being a C/C++ rookie in combination with not knowing Visual Studio much, doesn't help either. I've got VS set up building node from source, but I haven't figured out how to run test-child-process-send-err.js while setting a break point in win/pipe.c - which is a breeze in xcode..

Would appreciate any pointers in the right direction here! E.g. what I should pass as the uv_file argument, which is probably really simple, or how to get Visual Studio too behave :)

jasnell · 2016-01-15T01:14:40Z

@phillipj ... any luck on this so far? @trevnorris @bnoordhuis ... any ideas?

phillipj · 2016-01-15T06:52:13Z

@jasnell afraid not. Just pushed an update to clarify what I was trying to do with fs.openSync() to get a hold of a file descriptor which would pass the assertions in libuv/win/pipe.c - no luck so far.

saghul · 2016-01-15T14:38:39Z

What uv_pipe_open gets must be a valid pipe or it won't pass the windows handle duplication code IIRC.

phillipj · 2016-01-15T15:17:12Z

Thanks @saghul! I'll give it a shot

phillipj · 2016-01-17T22:19:54Z

Finally got the tests passing locally on windows 🎆

After @saghul's comment and code in libuv hinting about a parent process had to exist in windows to get this to pass, I did a test refactor to actually get valid IPCs created. Long story short, the tests now use child_process.fork() and run the tests inside the children.

Could you do a CI run again, if it looks good @trevnorris?

trevnorris · 2016-01-19T21:15:46Z

CI: https://ci.nodejs.org/job/node-test-pull-request/1306/

phillipj · 2016-01-20T15:53:37Z

Linting errors, that's just embarrassing - sorry about that. Synced against master and fixed style issues against updated rulesset.

The ARM test which failed (https://ci.nodejs.org/job/node-test-binary-arm/680/RUN_SUBSET=4,nodes=pi1-raspbian-wheezy/tapTestReport/test.tap-55/) seems to have a flaky history; #4055. @Trott have you seen this test being flaky after your PR got merged?

Trott · 2016-01-21T00:02:18Z

@phillipj No recent failures that I'm aware of. I'm going to go back and look at the change that test was written for and see if there's a way to test it that doesn't involve the "if it takes longer than one second, the bug is there" test which is going to be inherently flaky.

Trott · 2016-01-21T00:05:06Z

@phillipj Honestly, I think that test isn't actually valuable and should probably just be removed. But I'll take that up in a separate issue and/or PR.

test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: nodejs#4277

phillipj · 2016-01-21T15:38:57Z

Thanks for opening a PR deleting the test in lightning speed @Trott 👍

test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: #4277 PR-URL: #4786 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Brian White <[email protected]>

phillipj · 2016-02-19T13:25:55Z

What's your best advice on trigging flaky tests @Trott? Finally got a native win10 setup which made things alot easier to work with, but haven't been able to reproduce that previous cluster failure yet. So far I've repeated these two quite alot by hand:

python tools/test.py --mode=release message parallel sequential -J
python tools/test.py --mode=release parallel/test-cluster* -J

but I'm sure there's better ways of doing it..

Trott · 2016-02-19T19:22:11Z

@phillipj On Windows, I've done two different inelegant techniques:

Write a batch script to run the job over and over. I may have referred to http://stackoverflow.com/a/32152981/436641 or something like that. It's been awhile and I am not Windows savvy.
Put 10 or 100 copies of the test in its own directory and edit vcbuild.bat so there's a command that runs the test in that directory 10 or 100 times. (If you do 10 copies of the test 10 times, you get 100 consecutive runs.)

It may be worthwhile to run a stress test on CI against your branch to confirm that the test is still failing in CI.

jasnell · 2016-02-19T23:29:49Z

LGTM

phillipj · 2016-02-22T15:44:15Z

Thanks @Trott! Was not able to reproduce any failures with the script below :/ Would appreciate a CI run to see if it pops out again.

@echo off

for /l %%x in (1, 1, 100) do (
   Release\node.exe test\parallel\test-cluster-shared-leak.js
   echo Run #%%x: %errorlevel%
)

Trott · 2016-02-22T20:52:53Z

@phillipj Three stress tests now running on three different varieties of Windows machines:

Trott · 2016-02-22T21:24:12Z

@phillipj So far, 2 failures in 590 runs at https://ci.nodejs.org/job/node-stress-single-test/521/nodes=win10-1p/console. Key piece of info that might explain why you're not seeing the issue on your own machine: That's a single-processor host. Same exact thing on a multiprocessor host has 1580 runs so far with 0 failures.

Failures look like this:

not ok 1 test-cluster-shared-leak.js
# events.js:155
#       throw er; // Unhandled 'error' event
#       ^
# 
# Error: write EPIPE
#     at exports._errnoException (util.js:859:11)
#     at ChildProcess.target._send (internal/child_process.js:611:20)
#     at ChildProcess.target.send (internal/child_process.js:507:19)
#     at sendHelper (cluster.js:723:8)
#     at send (cluster.js:518:5)
#     at cluster.js:493:7
#     at SharedHandle.add (cluster.js:86:3)
#     at queryServer (cluster.js:485:12)
#     at Worker.onmessage (cluster.js:433:7)
#     at ChildProcess.<anonymous> (cluster.js:737:8)

I'm stress testing the code from your branch in this PR. I suppose I could/should run a stress test against master too. If the issue already exists there, then maybe we mark it as flaky (again, insert heavy sigh here) and move on. If the issue does not exist in master, then OMG, please don't land this. :-)

phillipj · 2016-02-22T21:40:20Z

@Trott nice! I don't have access to CI cause of the security lockdown
nowadays, but my fingers are crossed this is not introduced by this branch
:)

If these failures doesn't show up on CI / master, I could ofc try to
reproduce on a single CPU host, as long as you give me some hints about
what what CI stress testing actually means/performs.

On Monday, 22 February 2016, Rich Trott [email protected] wrote:

@phillipj https:/phillipj So far, 2 failures in 590 runs at
https://ci.nodejs.org/job/node-stress-single-test/521/nodes=win10-1p/console.
Key piece of info that might explain why you're not seeing the issue on
your own machine: That's a single-processor host. Same exact thing on a
multiprocessor host has 1580 runs so far with 0 failures.

Failures look like this:

not ok 1 test-cluster-shared-leak.js

events.js:155

throw er; // Unhandled 'error' event

^

Error: write EPIPE

at exports._errnoException (util.js:859:11)

at ChildProcess.target._send (internal/child_process.js:611:20)

at ChildProcess.target.send (internal/child_process.js:507:19)

at sendHelper (cluster.js:723:8)

at send (cluster.js:518:5)

at cluster.js:493:7

at SharedHandle.add (cluster.js:86:3)

at queryServer (cluster.js:485:12)

at Worker.onmessage (cluster.js:433:7)

at ChildProcess. (cluster.js:737:8)

I'm stress testing the code from your branch in this PR. I suppose I
could/should run a stress test against master too. If the issue already
exists there, then maybe we mark it as flaky (again, insert heavy sigh
here) and move on. If the issue does not exist in master, then OMG, please
don't land this. :-)

—
Reply to this email directly or view it on GitHub
#4277 (comment).

Trott · 2016-02-22T21:48:51Z

Now stress testing against master: https://ci.nodejs.org/job/node-stress-single-test/522/nodes=win10-1p/console

Trott · 2016-02-22T22:12:02Z

@phillipj 0 failures on master branch so far. 1500 runs and counting. (I set it for 9999 total.)

In the meantime, I'm not sure how you might best emulate the stress test in your local environment. @joaocgreis may have some ideas or suggestions.

Trott · 2016-02-22T22:55:23Z

6197 runs without a failure on current master. Either this PR makes the test flaky or something odd is going on (like there's an issue on one Azure host that isn't on the another Azure host).

phillipj · 2016-02-23T07:27:11Z

Sigh, rough nut to crack this one. Will have to assume it's this branch making the test flaky. I'll try to get my hands on a single CPU machine and hopefully get to reproduce the failure.

joaocgreis · 2016-02-23T14:52:30Z

@phillipj the best way I found so far was with a virtual machine. You can set the number of processors, and with VirtualBox you can even set a CPU cap, while it's running. I've had a test running in loop and made it fail and pass by setting the cap to 10% and back.

I could give you the exact script node-stress-single-test uses if CI wasn't down, but it's pretty much a refined version of what you pasted above. The only different thing is that it uses the test runner to run the test, something like python tools/test.py parallel/test-cluster-shared-leak (without test/ and .js).

joaocgreis · 2016-02-23T18:56:45Z

I confirmed on my Windows 10 VM (running single core, no cap) that this makes the test flaky. It is not a problem with the slaves.

phillipj · 2016-02-23T19:42:55Z

@joaocgreis much appreciated!

Trott · 2016-02-23T20:20:46Z

@joaocgreis You confirmed that @phillipj's branch makes it flaky? If it's not too much trouble and if you didn't already do it, can you confirm that it's not flaky on master? Don't want @phillipj to have to do a bunch of work if it's not his code causing the issue...

joaocgreis · 2016-02-23T21:08:53Z

It is not flaky on master, run 10k times without failing. For this PR's head, it failed twice in 1000 times.

test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: #4277 PR-URL: #4786 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Brian White <[email protected]>

This changes the behaviour of error events emitted when process.send() fails, from synchronously to asynchronously.

phillipj · 2016-03-11T20:40:08Z

I assume this PR should be closed in favor of #5251 which fixes a lot more of these inconsistent emits. Though as I mentioned in #5251 (comment) it would be nice to ensure this exact change doesn't merge flakyness into master.

Capping the CPU as @joaocgreis mentioned worked like a charm. It was easy to reproduce while being in sync with master around 21th january with 922d538 as parent commit. Being in sync with master today seems to fix the issue though, as I've not been able to reproduce it when rebasing on top of master.

I've updated the PR and would appreciate a new stress test to verify it's okey now, as a means of removing the uncertainty from #5251. @Trott would you be so kind?

Trott · 2016-03-11T21:06:21Z

@phillipj Are you asking for a stress test for this PR on Windows?

phillipj · 2016-03-11T22:19:16Z

@Trott Yes. On second thought, I could start the stress test myself now. I'll bug you on IRC for build details if that's okey.

phillipj · 2016-03-12T18:47:37Z

Stress test passed: https://ci.nodejs.org/job/node-stress-single-test/nodes=win10-1p/553/

Closing this in favor of #5251. Will reopen if that PR doesn't land.

thefourtheye · 2016-03-13T07:41:59Z

@phillipj Cool. Thanks for confirming that this indeed works as expected :-)

test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: nodejs#4277 PR-URL: nodejs#4786 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Brian White <[email protected]>

jasnell reviewed Dec 14, 2015
View reviewed changes

phillipj force-pushed the emit-cp-errors-nt branch from c52a93b to 1c56c5d Compare December 14, 2015 20:40

mscdex added the child_process Issues and PRs related to the child_process subsystem. label Dec 14, 2015

Fishrock123 added the semver-major PRs that contain breaking changes and should be released in the next major version. label Dec 21, 2015

Trott force-pushed the master branch from 1e896a6 to 082cc8d Compare December 27, 2015 02:01

trevnorris reviewed Dec 28, 2015
View reviewed changes

phillipj force-pushed the emit-cp-errors-nt branch from 1c56c5d to 6091ae6 Compare December 29, 2015 22:47

phillipj force-pushed the emit-cp-errors-nt branch from 6091ae6 to f57d87e Compare January 15, 2016 06:47

phillipj force-pushed the emit-cp-errors-nt branch from f57d87e to 285768d Compare January 17, 2016 22:09

phillipj force-pushed the emit-cp-errors-nt branch from 285768d to 6342838 Compare January 18, 2016 18:38

phillipj force-pushed the emit-cp-errors-nt branch from 6342838 to d5bf283 Compare January 20, 2016 15:44

Trott mentioned this pull request Jan 21, 2016

test: remove test-http-exit-delay #4786

Closed

jasnell added the error-handling label Mar 2, 2016

phillipj mentioned this pull request Mar 2, 2016

emit error events consistently in next tick #5251

Closed

child_process: emit .send() errors in next tick

3c38e87

This changes the behaviour of error events emitted when process.send() fails, from synchronously to asynchronously.

phillipj force-pushed the emit-cp-errors-nt branch from ab5589e to 3c38e87 Compare March 9, 2016 21:45

phillipj closed this Mar 12, 2016

		@@ -672,7 +672,6 @@ function setupChannel(target, channel) {
		return control;
		}

child_process: emit .send() errors in next tick #4277

child_process: emit .send() errors in next tick #4277

Conversation

phillipj commented Dec 14, 2015

jasnell Dec 14, 2015

Choose a reason for hiding this comment

phillipj Dec 14, 2015

Choose a reason for hiding this comment

jasnell Dec 14, 2015

Choose a reason for hiding this comment

Fishrock123 commented Dec 21, 2015

trevnorris Dec 28, 2015

Choose a reason for hiding this comment

trevnorris commented Dec 28, 2015

phillipj commented Dec 29, 2015

trevnorris commented Dec 29, 2015

phillipj commented Dec 29, 2015

phillipj commented Jan 7, 2016

jasnell commented Jan 15, 2016

phillipj commented Jan 15, 2016

saghul commented Jan 15, 2016

phillipj commented Jan 15, 2016

phillipj commented Jan 17, 2016

trevnorris commented Jan 19, 2016

phillipj commented Jan 20, 2016

Trott commented Jan 21, 2016

Trott commented Jan 21, 2016

phillipj commented Jan 21, 2016

phillipj commented Feb 19, 2016

Trott commented Feb 19, 2016

jasnell commented Feb 19, 2016

phillipj commented Feb 22, 2016

Trott commented Feb 22, 2016

Trott commented Feb 22, 2016

phillipj commented Feb 22, 2016

events.js:155

throw er; // Unhandled 'error' event

^

Error: write EPIPE

at exports._errnoException (util.js:859:11)

at ChildProcess.target._send (internal/child_process.js:611:20)

at ChildProcess.target.send (internal/child_process.js:507:19)

at sendHelper (cluster.js:723:8)

at send (cluster.js:518:5)

at cluster.js:493:7

at SharedHandle.add (cluster.js:86:3)

at queryServer (cluster.js:485:12)

at Worker.onmessage (cluster.js:433:7)

at ChildProcess. (cluster.js:737:8)

Trott commented Feb 22, 2016

Trott commented Feb 22, 2016

Trott commented Feb 22, 2016

phillipj commented Feb 23, 2016

joaocgreis commented Feb 23, 2016

joaocgreis commented Feb 23, 2016

phillipj commented Feb 23, 2016

Trott commented Feb 23, 2016

joaocgreis commented Feb 23, 2016

phillipj commented Mar 11, 2016

Trott commented Mar 11, 2016

phillipj commented Mar 11, 2016

phillipj commented Mar 12, 2016

thefourtheye commented Mar 13, 2016