-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Named mutex bug on .NET using Mono runtime causing compile server OOM #57002
Comments
dotnet-issue-labeler
bot
added
Area-Compilers
untriaged
Issues and PRs which have not yet been triaged by a lead
labels
Oct 7, 2021
uweigand
added a commit
to uweigand/roslyn
that referenced
this issue
Oct 7, 2021
* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes dotnet#57002
uweigand
added a commit
to uweigand/roslyn
that referenced
this issue
Oct 7, 2021
* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes dotnet#57002
jaredpar
added
Bug
and removed
untriaged
Issues and PRs which have not yet been triaged by a lead
labels
Oct 7, 2021
uweigand
added a commit
to uweigand/roslyn
that referenced
this issue
Oct 8, 2021
* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes dotnet#57002
uweigand
added a commit
to uweigand/roslyn
that referenced
this issue
Oct 11, 2021
* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes dotnet#57002
uweigand
added a commit
to uweigand/roslyn
that referenced
this issue
Oct 11, 2021
* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes dotnet#57002
uweigand
added a commit
to uweigand/roslyn
that referenced
this issue
Oct 11, 2021
* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes dotnet#57002
jaredpar
pushed a commit
that referenced
this issue
Oct 20, 2021
* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes #57002
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Version Used: .NET 6 RC 1 (Visual-Studio-2022-Version-17.0-Preview-4)
This is running on the RC1 SDK & runtime built for the new linux-s390x architecture, using the Mono runtime
Steps to Reproduce:
Run a large compile task (e.g. building any of the runtime, roslyn, or aspnetcore projects) natively on Linux on s390x, on a system with at most 16 GB of memory.
Expected Behavior:
Compilation completes successfully.
Actual Behavior:
Compilation is aborted by the operating system out-of-memory killer.
Background and detailed analysis:
We are in the process of enabling .NET 6 on the IBM Z mainframe (linux-s390x architecture). This platform does not support the CoreCLR JIT at this time, so we are using the Mono JIT to run everything, including the roslyn compiler. In general, this is working well (e.g. we can bootstrap the whole SDK & runtime building natively on s390x). However, recent test runs on smaller machines exposed this OOM problem.
The immediate root cause for the OOM situation is that a large number of compile server processes (VBCSCompiler.dll) are started simultaneously, using the same pipename. This is the situation ("consuming excess resources") described in this comment in
RunServer
(src/Compilers/Server/VBCSCompiler/BuildServerController.cs
):As the comment states, the problem is supposed to be prevented via a C# named mutex that ensures only a single compile server is being started. Unfortunately, when using .NET with the Mono runtime, named mutexes are not fully implemented and only ensure mutual exclusion within the same process, not across multiple processes (see dotnet/runtime#48720). This is the same issue that is also present when running on a full Mono installation instead of on .NET.
The roslyn code already contains a workaround for that problem, originally added here: #31497, which will use file-based locking instead of named mutexes to guarantee mutual exclusion when running on Mono.
Unfortunately, this workaround does not actually help in our scenario - we are seeing multiple issues with the current implementation, which I'll describe here. I also have implemented a patch that fixes those problems for us; I'll post a PR shortly, and this issue is also intended to document the choices made in that PR.
ServerFileMutexPair
is guarded via a checkPlatformInformation.IsRunningOnMono
, which is only true when running on an actual Mono installation, it is false when running on .NET 5 or later using the Mono runtime.I've created a different test that is true whenever the Mono runtime is used (matches a similar test in arcade).
Then, when we actually get to use
ServerFileMutexPair
, we experience several other problems with its implementation:The constructor uses a
FileStream
constructor withFileShare.None
. This means that if any other process or thread holds the mutex, the constructor will throw anIOException
because theFileShare.None
prevents opening the lock file. This is not useful - normally, we'd then want to useTryLock
to try and wait until the mutex becomes available, but we don't even get there. (I'm not sure why the current implementation worked originally on Mono - maybeFileShare.None
was implemented differently there?)Assuming that problem is fixed somehow, then the actual
TryLock
implementation isn't fully safe: it uses process-based record locking (FileStream.Lock
), which only locks against other processes. But the mutex really needs to lock against other threads of the current process too, which the current implementation seems to simply ignore.Just a minor issue compared to the others, but the lock files are never cleaned up and just fill up the tmp directory over time ...
I've reimplemented the logic to avoid use of
FileStream.Lock
and use the presence of the file itself (protected viaFileShare.None
) as the actual mutex. This should resolve all three of the above problems (and as a side effect allows implementingWasOpen
via a simple check for file existence).Finally, when all this is implemented we run into issues in the test suite:
Several tests create the server mutex themselves out of band to simulate a server already running. These tests currently hard-code a named mutex - they should be using the roslyn abstraction (either named mutex or
ServerFileMutexPair
) themselves. That was straightforward to fix.Finally, there was a subtle race condition in the
ServerFailsWithLongTempPathUnix
test case, which tries to run the following scenario:TMPDIR
to some directory with a long nameTMPDIR
But due to -what looks to me- incorrect use of
async
, it could happen that resettingTMPDIR
actually was done already before starting the compile job, which has the effect that starting the server and running the job were using two different settings ofTMPDIR
. This means that theFileMutex
lock files were in two different places, and we didn't actually have mutual exclusion any more. Fixed now by using an extraawait
- if I missed a more elegant solution here, I'd appreciate any suggestion!Please let me know if the new locking approach looks reasonable - I'd be happy to work on alternative solutions as well if that would be preferable.
FYI @omajid @tmds @nealef
FYI @steveisok @directhex @crummel @leecow
The text was updated successfully, but these errors were encountered: