Potential infinite loop with canvas recording web worker #13743

billyvg · 2024-09-20T19:04:50Z

Seeing a potential cycle with canvas replay web worker. Stack trace looks something like this:

sendBufferedReplayOrFlush
startRecording
...
getCanvasMananger
new CanvasManager
initFPSWorker
new window.Worker

Then the part that seems to cycle:

sendBufferedReplayOrFlush
stopRecording
_stopRecording
...
???.reset (mutation buffer i think?)
CanvasManager.reset
initFPSWorker

Customer says this generally happens after a user comes back to the tab from a long period of idling.

Zendesk ticket

The text was updated successfully, but these errors were encountered:

trogau · 2024-09-20T20:43:55Z

Hi folks, I reported this issue via support & they advised to come here. Happy to provide more information as needed.

To explain in a bit more detail:

We have an internal application built on node.js with Vue.js. We are running the Vue Sentry package, version 8.25.0 (which we realise is a couple versions behind current).

This issue was reported by our internal users who are having tabs in Chrome (latest Chrome version, running on Chromebooks - Intel i5s with 8GB of RAM) freeze when performing certain actions. Some of these actions are triggering console errors, which might be contributing to the behaviour, but I'm not sure about this.

When looking at a frozen tab, there's not much we can diagnose - devtools is locked up. We can see in Chrome Task Manager that there are many, many dedicated workers running under the frozen Chrome process, and memory usage seems significantly higher than normal.

The tab will remain frozen, with periodic dialogs from Chrome asking if we want to wait or exit. I think waiting does nothing but simply try to spin up more dedicated workers, though it's hard to tell because the machine is a bit unwieldy by this point and there are so many it is hard to see what is going on - the only recovery is to close the tab.

We made a little Chrome extension to override the 'new Worker' class just to see if we could identify the issue, and captured the stack trace when the workers were created, and it showed something like the following, repeated over and over again:

2024-09-20T16:54:21+10:00 | https://app.example.com | [ip address redacted] | [16:54:20] Creating worker with script: blob:https://app.explorate.co/4899ff93-b770-4fa7-8345-bc6aaa98fa2d, Stack trace: Error
    at new <anonymous> (chrome-extension://egfiddhbdemmalmbdeockdnknmffohmg/injected.js:12:32)
    at Nrt.initFPSWorker (https://app.explorate.co/assets/index-dbcae719.js:750:9710)
    at Nrt.reset (https://app.explorate.co/assets/index-dbcae719.js:750:7943)
    at Xet.reset (https://app.explorate.co/assets/index-dbcae719.js:746:14920)
    at https://app.explorate.co/assets/index-dbcae719.js:746:27620
    at Array.forEach (<anonymous>)
    at https://app.explorate.co/assets/index-dbcae719.js:746:27607
    at https://app.explorate.co/assets/index-dbcae719.js:746:15369
    at https://app.explorate.co/assets/index-dbcae719.js:746:43554
    at Array.forEach (<anonymous>)
    at Fg._stopRecording (https://app.explorate.co/assets/index-dbcae719.js:746:43542)
    at Fg.stopRecording (https://app.explorate.co/assets/index-dbcae719.js:748:6110)
    at Fg.stop (https://app.explorate.co/assets/index-dbcae719.js:748:6417)
    at Fg._refreshSession (https://app.explorate.co/assets/index-dbcae719.js:748:9881)
    at Fg._checkSession (https://app.explorate.co/assets/index-dbcae719.js:748:9801)
    at Fg.checkAndHandleExpiredSession (https://app.explorate.co/assets/index-dbcae719.js:748:8242)
    at Fg._doChangeToForegroundTasks (https://app.explorate.co/assets/index-dbcae719.js:748:11563)
    at _handleWindowFocus (https://app.explorate.co/assets/index-dbcae719.js:748:11189)
    at r (https://app.explorate.co/assets/index-dbcae719.js:741:4773)

I was able to reproduce this on my machine by:

Loading our application
Moving to a different tab and/or leaving the PC for an hour or so
Coming back to the application and resuming activity in the initial tab

Doing that would regularly trigger a burst of worker creation. On my more powerful laptop (i7 / 32GB) I triggered about 100 workers being created at once, though it didn't cause any noticeable performance issues.

My guess is that on the lower spec machines, when a lot of workers are created it simply crawls to a halt and then crashes, and that there is a loop or race condition that is triggering endless worker creations in the Sentry Replay code, either as a direct result of something weird in our code or just a random bug somewhere.

There are two things we have on our TODO to try here:

Upgrade to the latest version of the Sentry/vue package
Disable the canvas recording

Open to any other suggestions as well if it helps zero in on the issue.

billyvg · 2024-09-20T22:20:02Z

Thanks for the detailed description @trogau -- just want to clarify a few details:

what are your replay sample rates (session and onError)?
regarding your chrome extension when overriding the worker class: is it throwing an error and causing session replays to try to capture a new replay?
do the other stack traces also have _handleWindowFocus at the bottom of the trace?

trogau · 2024-09-20T22:37:08Z

Sample rates are 0.05 for session and 1.0 for errors
No, doesn't throw an error when a worker is created, only logs the event & sends it to our remote endpoint to catch the data.
No, sorry - actually _handleWindowFocus actually only seems to show up in a couple of the most recent events when I was doing some testing yesterday. The below one is more representative of what we're seeing:

2024-09-16T14:51:29+10:00 | https://app.example.com | [ip address redacted] | [14:51:28] Creating worker with script: blob:https://app.explorate.co/ae32033e-b5b8-4299-acf6-6173dde42e7f, Stack trace: Error
    at new window.Worker (chrome-extension://mfenbcgblaedimllfnpabdkgcbggfcml/injected.js:11:32)
    at Nrt.initFPSWorker (https://app.explorate.co/assets/index-da618cdf.js:750:9710)
    at Nrt.reset (https://app.explorate.co/assets/index-da618cdf.js:750:7943)
    at Xet.reset (https://app.explorate.co/assets/index-da618cdf.js:746:14920)
    at https://app.explorate.co/assets/index-da618cdf.js:746:27620
    at Array.forEach (<anonymous>)
    at https://app.explorate.co/assets/index-da618cdf.js:746:27607
    at https://app.explorate.co/assets/index-da618cdf.js:746:15369
    at https://app.explorate.co/assets/index-da618cdf.js:746:43554
    at Array.forEach (<anonymous>)
    at Fg._stopRecording (https://app.explorate.co/assets/index-da618cdf.js:746:43542)
    at Fg.stopRecording (https://app.explorate.co/assets/index-da618cdf.js:748:6110)
    at Fg.stop (https://app.explorate.co/assets/index-da618cdf.js:748:6417)
    at Fg._runFlush (https://app.explorate.co/assets/index-da618cdf.js:748:13600)

I should note: I have not yet actually captured a stack trace from an actual crash; we haven't had one for a few days where the extension was actually running and logging data. The events we've been capturing so far - which again show up to around ~100 workers getting created, which doesn't seem like enough to cause a crash even on the Chromebooks - are happening relatively frequently though.

trogau · 2024-09-23T01:05:57Z

We captured a stack trace from a freeze this morning & seems to confirm it is mass creation of workers that causes the problem. Attached is a log snippet showing about 1008 workers created in ~3 seconds, which froze the browser tab. Not sure how helpful it is but just thought I'd include it for reference.

log.txt

chargome · 2024-09-23T07:53:22Z

@trogau thanks for the insights, could you also specify which tasks you are running on the canvas? Is it like a continuous animation or a static canvas – this might help reproducing the issue.

trogau · 2024-09-23T08:37:20Z

@trogau thanks for the insights, could you also specify which tasks you are running on the canvas? Is it like a continuous animation or a static canvas – this might help reproducing the issue.

@chargome : I'm double checking with our team but AFAIK the pages where we're seeing this happen do not have any canvas elements at all. We do have /some/ pages with canvas (a MapBox map component) but this isn't loaded on the page where we're seeing the majority of these issues.

We do have Sentry.replayCanvasIntegration() being set in our Sentry.init() though.

trogau · 2024-09-25T09:08:30Z

FYI we've upgraded to v8.31.0 and still seeing large numbers of workers created (just had one instance of 730 created in a few seconds - not enough for it to crash the tab so the user didn't notice but we see it in the logging. The magic number seems to be about 1000 workers being enough to freeze the tab on these devices.

billyvg · 2024-10-02T21:34:22Z

@trogau Thanks for your help, I believe I've identified the issue here: #13855 -- can you actually try downgrading to 8.25.0 to see if that's affected?

edit Also, do you do anything custom with the replay integration (i.e. call replay.flush() somewhere?)

trogau · 2024-10-04T00:40:36Z

Hi @billyvg - we don't do anything custom with the Replay integration - just set it up in init and that's it.

v8.25.0 is what we were using initially that definitely did have the problem - happy to downgrade if there's something specific we can test, but I can confirm v8.25.0 was where we first experienced the issue.

Lms24 · 2024-10-07T13:23:34Z

Hi, chiming in here quickly because our internal issue slack bot is nagging us: @billyvg would you mind taking another look at this? thx!

billyvg · 2024-10-10T18:00:32Z

@trogau can you try out version 8.34.0 and see if that helps?

I'm still working on fixing some other unexpected behaviors with regards to session durations that may also be related.

trogau · 2024-10-15T21:59:44Z

We've just deployed v8.34.0 so will track that & see how it goes.

trogau · 2024-10-16T04:47:52Z

@billyvg : FYI just had our first freeze on v8.34.0 - can see it triggered ~1000 workers created in ~2 seconds which crashed the machine.

billyvg · 2024-10-16T16:47:06Z

@trogau ok, can you try two things:

8.35.0-beta.0
Add a beforeErrorSampling callback to replayIntegration options and log the event to your backend. I want to verify this is the callsite that triggers sendBufferedReplayOrFlush

replayIntegration({
  beforeErrorSampling: event => {
    // TODO: log to backend service
    return event;
  },
});

getsantry bot added the Waiting for: Product Owner label Sep 20, 2024

getsantry bot removed the Waiting for: Product Owner label Sep 20, 2024

getsantry bot added the Waiting for: Product Owner label Sep 20, 2024

getsantry bot removed the Waiting for: Product Owner label Sep 23, 2024

getsantry bot added the Waiting for: Product Owner label Sep 23, 2024

chargome removed the Waiting for: Product Owner label Sep 23, 2024

getsantry bot added the Waiting for: Product Owner label Sep 25, 2024

Lms24 removed the Waiting for: Product Owner label Sep 25, 2024

billyvg mentioned this issue Oct 2, 2024

Events can be added to buffer despite the session being expired #13855

Open

getsantry bot added the Waiting for: Product Owner label Oct 4, 2024

hiroshinishio mentioned this issue Oct 5, 2024

Events can be added to buffer despite the session being expired hiroshinishio/sentry-javascript#9

Open

getsantry bot removed the Waiting for: Product Owner label Oct 7, 2024

billyvg self-assigned this Oct 15, 2024

getsantry bot added the Waiting for: Product Owner label Oct 15, 2024

getsantry bot removed the Waiting for: Product Owner label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential infinite loop with canvas recording web worker #13743

Potential infinite loop with canvas recording web worker #13743

billyvg commented Sep 20, 2024

trogau commented Sep 20, 2024 •

edited

Loading

billyvg commented Sep 20, 2024

trogau commented Sep 20, 2024

trogau commented Sep 23, 2024

chargome commented Sep 23, 2024

trogau commented Sep 23, 2024 •

edited

Loading

trogau commented Sep 25, 2024

billyvg commented Oct 2, 2024 •

edited

Loading

trogau commented Oct 4, 2024

Lms24 commented Oct 7, 2024

billyvg commented Oct 10, 2024

trogau commented Oct 15, 2024

trogau commented Oct 16, 2024

billyvg commented Oct 16, 2024

Potential infinite loop with canvas recording web worker #13743

Potential infinite loop with canvas recording web worker #13743

Comments

billyvg commented Sep 20, 2024

trogau commented Sep 20, 2024 • edited Loading

billyvg commented Sep 20, 2024

trogau commented Sep 20, 2024

trogau commented Sep 23, 2024

chargome commented Sep 23, 2024

trogau commented Sep 23, 2024 • edited Loading

trogau commented Sep 25, 2024

billyvg commented Oct 2, 2024 • edited Loading

trogau commented Oct 4, 2024

Lms24 commented Oct 7, 2024

billyvg commented Oct 10, 2024

trogau commented Oct 15, 2024

trogau commented Oct 16, 2024

billyvg commented Oct 16, 2024

trogau commented Sep 20, 2024 •

edited

Loading

trogau commented Sep 23, 2024 •

edited

Loading

billyvg commented Oct 2, 2024 •

edited

Loading