Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure tracing/eventpipe/pauseonstart/pauseonstart/pauseonstart.sh #58781

Closed
VincentBu opened this issue Sep 8, 2021 · 8 comments
Closed
Labels
arch-x64 area-Tracing-mono os-mac-os-x macOS aka OSX untriaged New issue has not been triaged by the area owner

Comments

@VincentBu
Copy link
Contributor

Run: runtime 20210907.81

Failed test:

Mono OSX x64 Release no_tiered_compilation @ OSX.1014.Amd64.Open

- tracing/eventpipe/pauseonstart/pauseonstart/pauseonstart.sh

Error message:

cmdLine:/private/tmp/helix/working/BAFF09A3/w/B0FC09AD/e/tracing/eventpipe/pauseonstart/pauseonstart/pauseonstart.sh Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 9/7/2021 7:24:45 PM, end: 9/7/2021 7:34:45 PM)

Return code:      -100
Raw output file:      /tmp/helix/working/BAFF09A3/w/B0FC09AD/uploads/Reports/tracing.eventpipe/pauseonstart/pauseonstart/pauseonstart.output.txt
Raw output:
BEGIN EXECUTION
/tmp/helix/working/BAFF09A3/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false pauseonstart.dll ''
0.0s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-57418-1631043838-socket
0.1s: Deleted
0.1s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-68606-1631053131-socket
0.1s: Deleted
0.1s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-57649-1631044584-socket
0.1s: Deleted
0.1s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-68274-1631052104-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-68228-1631052009-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-63645-1631049594-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-62767-1631049297-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-66300-1631050566-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-66218-1631050419-socket
0.4s: Deleted
0.4s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-68382-1631052402-socket
0.4s: Deleted
0.4s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-63605-1631049556-socket
0.4s: Deleted
0.4s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-63565-1631049523-socket
0.5s: Deleted
0.5s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-59947-1631045751-socket
0.5s: Deleted
0.5s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-57689-1631044618-socket
0.6s: Deleted
0.6s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-57568-1631044485-socket
0.6s: Deleted
0.6s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-57265-1631043488-socket
0.6s: Deleted
0.6s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-58314-1631044905-socket
0.7s: Deleted
0.7s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-62678-1631048835-socket
0.7s: Deleted
0.7s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-61227-1631047264-socket
0.7s: Deleted
0.7s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-61864-1631048493-socket
0.8s: Deleted
0.8s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-57458-1631043890-socket
0.8s: Deleted
0.8s: Attempting to delete the zombied pipe: /var/folders/cj/wlyt50_x3jn0f45vndj7ffs40000gy/T/dotnet-diagnostic-67770-1631051913-socket
0.8s: Deleted
0.8s: Attempting to delete the zombied pipe: /v


Stack trace
   at tracing_eventpipe._pauseonstart_pauseonstart_pauseonstart_._pauseonstart_pauseonstart_pauseonstart_sh()
@dotnet-issue-labeler dotnet-issue-labeler bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Sep 8, 2021
@tommcdon
Copy link
Member

tommcdon commented Sep 8, 2021

@josalem @lateralusX

@lateralusX
Copy link
Member

Looks like that test run had failures in CoreCLR Windows arm64 as well:

System.InvalidOperationException: Collection was modified after the enumerator was instantiated.
   at System.Collections.Generic.Stack`1.Enumerator.MoveNext()
   at Xunit.Sdk.DisposalTracker.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\DisposalTracker.cs:line 26
   at Xunit.Sdk.ExtensibilityPointFactory.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\ExtensibilityPointFactory.cs:line 53
   at Xunit.Sdk.TestFramework.Dispose() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\Frameworks\TestFramework.cs:line 46
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__127_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

Couldn't get more info on CI regarding Mono failure in pauseonstart.sh, will try to stress it locally to see if something shows up. Since this happens in main and infrequent (only occurrence?), I suggest moving it to .net7 but continue keep eye's on it. I report back results on local repro attempts as well.

@lateralusX
Copy link
Member

Have run 1000 local runs on OSX amd64 without repro. Will run some more.

@josalem
Copy link
Contributor

josalem commented Sep 9, 2021

That coreclr failure is a known issue with XUnit in our tests #11063.

@lateralusX
Copy link
Member

Repros locally, I was missing that this was interpreter mode, so did initial runs with JIT, will investigate.

@lateralusX
Copy link
Member

lateralusX commented Sep 10, 2021

Identified deadlock between finalizer thread and rundown enumerating all interpreter method. Since rundown will query for method name in callback when iterating interpreter methods, interp_jit_info_foreach, that might lead to additional loader activity. The hash map in interpreter keeping the methods is locked with default JIT memory manager, but since the callback might end up in mono_class_create_from_typedef that will take loader lock, we get the following lock order on that code path, memory manager->loader lock. Finalizer thread invokes OnThreadExiting using interpreter and that might end up with a reverse lock order, loader lock->memory manager on that code path, so these two have a potential to deadlock. This is not a problem under JIT or AOT since the JIT hash table is lock free therefore not causing any deadlocks due to lock order between memory manager and loader lock.

Will investigate fix, but changing into a lock free hash table in interpreters for interp_code_hash might be to risky at this point. A more safe fix is to take a copy of the pointers while holding lock and then iterate using local copy (simple array of pointers). Since this method is only called during rundown, only when using interpreter, and only include the pointers (InterpMethod *) we use with the callback, it will have some temporary memory impact (allocating an array of pointers), but will mitigate the deadlock since we can safely call iterator callback without holding the lock. It will also improve interpreter performance in situations where we run session rundown, since lock will be held a much shorter amount of time.

@lateralusX
Copy link
Member

lateralusX commented Sep 10, 2021

Have pending fix that seems to resolve the deadlock. Will need to run it on mobile with EventPipe sample profile as well to make sure we still get all info from rundown when running interpreter and that stackwalks looks resonable.

@lateralusX
Copy link
Member

Should be fixed by #58996.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-Tracing-mono os-mac-os-x macOS aka OSX untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

4 participants