Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e pool-load-success test hanging on close #3063

Closed
philrz opened this issue May 2, 2024 · 4 comments · Fixed by #3069
Closed

e2e pool-load-success test hanging on close #3063

philrz opened this issue May 2, 2024 · 4 comments · Fixed by #3069
Assignees
Labels
test Creating/improving test automation

Comments

@philrz
Copy link
Contributor

philrz commented May 2, 2024

Repro is with Zui commit ca047c0.

When working with Zui since the merge of #3061, I noticed one of the e2e tests failing that hadn't been before. How the failure looks:

  ✘  2 pool-load-success.spec.ts:27:7 › Pool Loads (successes) › load more data into the pool (2.5s)
  1) pool-load-success.spec.ts:27:7 › Pool Loads (successes) › load more data into the pool ────────
    "afterAll" hook timeout of 30000ms exceeded.
      10 |   });
      11 |
    > 12 |   test.afterAll(async () => {
         |        ^
      13 |     await app.shutdown();
      14 |   });
      15 |
        at /Users/phil/work/zui/packages/zui-player/tests/pool-load-success.spec.ts:12:8
  1 failed
    pool-load-success.spec.ts:27:7 › Pool Loads (successes) › load more data into the pool ─────────

I just did a controlled set of repros and it does indeed seem like the failures are correlated with those changes. I used these repro steps:

$ SUCCESSES=0; FAILURES=0; while true; do   yarn e2e -g pool-load-success.spec.ts --skip-nx-cache; if [ $? -eq 0 ] ; then SUCCESSES=$(expr $SUCCESSES + 1); else FAILURES=$(expr $FAILURES + 1); fi; echo "Successful test runs=$SUCCESSES, failed test runs=$FAILURES"; done

Running at commit 59c4c4a that was right before the changes in #3061 produced 53 successful runs and 0 failures before I stopped it. At commit ca047c0 for the merge of #3061 it produced 46 successful runs and 10 failures.

@philrz philrz added the test Creating/improving test automation label May 2, 2024
@philrz philrz self-assigned this May 2, 2024
@philrz
Copy link
Contributor Author

philrz commented May 3, 2024

In an attempt to divide & conquer, I made a branch zui-3063-electron-only which was started at Zui commit 59c4c4a right before the changes in #3061 merged to main. In that branch I made just the changes from #3061 to advance Electron to 30.0.1 and .node_version to 20.11.0 but did not make any of the changes to use Node native Fetch in lieu of node-fetch. Using the same repro loop described above, with this branch I saw 120 successful runs of pool-load-success.spec.ts and 0 failures before I stopped it. That seems to indicate that the problems captured in this issue are somehow specific to the Node native Fetch and/or how we use it.

@philrz
Copy link
Contributor Author

philrz commented May 7, 2024

I studied the problem a bit at the packet layer, catching successful and failed test runs in attached files success.pcap.gz and failure.pcap.gz, respectively. Since the problem seems to happen right as the app is being closed at the end of the test I paid particularly close attention to connection closure activity in the pcaps by applying Wireshark filter tcp.flags.fin == 1 || tcp.flags.reset == 1.

The first closure activity always comes from the Chromium user agent connection that initiates a call to the Zed lake service /version endpoint and later polls the /events endpoint. The FIN for that connection is at packet 596 in the success case and packet 521 for the failure case. What we see shortly after this is that eight Node-based HTTP connections each see FINs incoming from the Zed lake service side. In the success case, the client promptly responds and the connections are closed, e.g., starting with packet 604 here:

image

In the failure case (starting with packet 531 here) the client does not respond immediately with its own FIN. This results in the server side sending TCP keep-alives every 15 seconds until the 60-second mark, at which time things finally shut down presumably because Playwright does a hard kill on the client due to the test timing out.

image

This led me to guess that the root cause might be a timing issue related to one or more of these connections still having final bits of work still wrapping up right when the test starts trying to gracefully close things down. To test this theory, I did a sets of looped runs with a branch zui-3063-sleep where I could add a sleep timer at the end of the failing e2e test before the shutdown portion begins. In a baseline loop of 100 iterations without the sleep timer, I saw 83 successful runs and 17 failed runs. Adding a 5-second sleep, that went to 98 successful runs and 2 failed runs. Then at a 10-second sleep, I saw 400+ successful runs and zero failed runs before I stopped the test. This does seem to indicate there could be something to that theory.

@philrz
Copy link
Contributor Author

philrz commented May 7, 2024

The next round of debug was to add console.log() messages at various points in dependencies. Ultimately it was this change in node_modules/playwright-core/lib/server/electron/electron.js that showed us what we needed to see:

*** electron.js	Tue May  7 12:27:10 2024
--- /tmp/electron.js	Tue May  7 12:25:57 2024
***************
*** 196,202 ****
          args: electronArguments,
          env,
          log: message => {
-           console.log('xxx I am in playwright-core/lib/server/electron/electron.js with message: ' + message);
            progress.log(message);
            browserLogsCollector.log(message);
          },
--- 196,201 ----

In a successful run we see:

xxx I am in playwright-core/lib/server/electron/electron.js with message: <launching> /Users/phil/work/zui/node_modules/electron/dist/Electron.app/Contents/MacOS/Electron -r /Users/phil/work/zui/node_modules/playwright-core/lib/server/electron/loader.js --inspect=0 --remote-debugging-port=0 --user-data-dir=/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1 ../../apps/zui
xxx I am in playwright-core/lib/server/electron/electron.js with message: <launched> pid=80172
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] Debugger listening on ws://127.0.0.1:49287/77be6eb9-18d5-49b9-bebc-6088e07ce78b
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] For help, see: https://nodejs.org/en/docs/inspector
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] Debugger attached.
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out] 12:29:46.918 › booting main with: {
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   lakePort: 9867,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   lakeRoot: '/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1/lake',
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   lakeLogs: '/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1/logs',
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   lake: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   devtools: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   appState: '/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1/appState.json',
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   autosave: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   releaseNotes: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   autoUpdater: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   singleInstance: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out]   pluginsPath: '/Users/phil/work/zui/plugins'
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out] }
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out] 12:29:46.947 › initializers loaded
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] 
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] DevTools listening on ws://127.0.0.1:49289/devtools/browser/1fe3a064-e254-4381-a7bc-802ab926a664
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][out] 12:29:47.020 › Running Updater in Mode: startup
  ✓  1 pool-load-success.spec.ts:17:7 › Pool Loads (successes) › load data into a pool (4.0s)
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] Debugger ending on ws://127.0.0.1:49287/77be6eb9-18d5-49b9-bebc-6088e07ce78b
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] For help, see: https://nodejs.org/en/docs/inspector
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172][err] context canceled
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172] <process did exit: exitCode=0, signal=null>
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172] starting temporary directories cleanup
  ✓  2 pool-load-success.spec.ts:27:7 › Pool Loads (successes) › load more data into the pool (2.9s)
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80172] finished temporary directories cleanup

In a failed run:

xxx I am in playwright-core/lib/server/electron/electron.js with message: <launching> /Users/phil/work/zui/node_modules/electron/dist/Electron.app/Contents/MacOS/Electron -r /Users/phil/work/zui/node_modules/playwright-core/lib/server/electron/loader.js --inspect=0 --remote-debugging-port=0 --user-data-dir=/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1 ../../apps/zui
xxx I am in playwright-core/lib/server/electron/electron.js with message: <launched> pid=80292
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] Debugger listening on ws://127.0.0.1:49330/a50c67e4-e0ab-484a-b936-50fe384987a9
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] For help, see: https://nodejs.org/en/docs/inspector
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] Debugger attached.
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out] 12:30:15.117 › booting main with: {
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   lakePort: 9867,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   lakeRoot: '/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1/lake',
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   lakeLogs: '/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1/logs',
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   lake: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   devtools: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   appState: '/Users/phil/work/zui/packages/zui-player/run/playwright-itest/Pool Loads (succeeses)/1/appState.json',
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   autosave: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   releaseNotes: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   autoUpdater: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   singleInstance: true,
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out]   pluginsPath: '/Users/phil/work/zui/plugins'
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out] }
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out] 12:30:15.147 › initializers loaded
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] 
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] DevTools listening on ws://127.0.0.1:49332/devtools/browser/69241538-0bb7-4e5f-9a59-1a9909ecd718
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][out] 12:30:15.212 › Running Updater in Mode: startup
  ✓  1 pool-load-success.spec.ts:17:7 › Pool Loads (successes) › load data into a pool (3.8s)
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] Debugger ending on ws://127.0.0.1:49330/a50c67e4-e0ab-484a-b936-50fe384987a9
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] For help, see: https://nodejs.org/en/docs/inspector
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292][err] context canceled
  ✘  2 pool-load-success.spec.ts:27:7 › Pool Loads (successes) › load more data into the pool (2.9s)
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292] <gracefully close start>
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292] <kill>
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292] <will force kill>
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292] starting temporary directories cleanup
xxx I am in playwright-core/lib/server/electron/electron.js with message: [pid=80292] finished temporary directories cleanup

In contemplating the point of hang indicated by this, @jameskerr's web searches found nodejs/undici#2026 that looked like it might very well be the root cause of our problems. It showed up with the same transition we made to the newer Electron that uses the newer Node, and users in the issue also speak of it being an intermittent problem that goes away when they use node-fetch instead of the native fetch.

@philrz
Copy link
Contributor Author

philrz commented Jun 7, 2024

Verified on Zui commit 5fa9071.

I started this loop of the e2e test at this commit to check for failure rates.

$ SUCCESSES=0; FAILURES=0; while true; do   yarn e2e -g pool-load-success.spec.ts --skip-nx-cache; if [ $? -eq 0 ] ; then SUCCESSES=$(expr $SUCCESSES + 1); else FAILURES=$(expr $FAILURES + 1); fi; echo "Successful test runs=$SUCCESSES, failed test runs=$FAILURES"; done

It made it through 210 successful runs and 0 failures before I stopped the loop. Compared to the count of 46 successful runs and 10 failures we saw previously, this definitely indicates the problem has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Creating/improving test automation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant