Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named mutex bug on .NET using Mono runtime causing compile server OOM #57002

Closed
uweigand opened this issue Oct 7, 2021 · 0 comments · Fixed by #57003
Closed

Named mutex bug on .NET using Mono runtime causing compile server OOM #57002

uweigand opened this issue Oct 7, 2021 · 0 comments · Fixed by #57003
Assignees
Milestone

Comments

@uweigand
Copy link
Contributor

uweigand commented Oct 7, 2021

Version Used: .NET 6 RC 1 (Visual-Studio-2022-Version-17.0-Preview-4)
This is running on the RC1 SDK & runtime built for the new linux-s390x architecture, using the Mono runtime

Steps to Reproduce:

Run a large compile task (e.g. building any of the runtime, roslyn, or aspnetcore projects) natively on Linux on s390x, on a system with at most 16 GB of memory.

Expected Behavior:

Compilation completes successfully.

Actual Behavior:

Compilation is aborted by the operating system out-of-memory killer.

Background and detailed analysis:

We are in the process of enabling .NET 6 on the IBM Z mainframe (linux-s390x architecture). This platform does not support the CoreCLR JIT at this time, so we are using the Mono JIT to run everything, including the roslyn compiler. In general, this is working well (e.g. we can bootstrap the whole SDK & runtime building natively on s390x). However, recent test runs on smaller machines exposed this OOM problem.

The immediate root cause for the OOM situation is that a large number of compile server processes (VBCSCompiler.dll) are started simultaneously, using the same pipename. This is the situation ("consuming excess resources") described in this comment in RunServer (src/Compilers/Server/VBCSCompiler/BuildServerController.cs):

            // Grab the server mutex to prevent multiple servers from starting with the same
            // pipename and consuming excess resources. If someone else holds the mutex
            // exit immediately with a non-zero exit code

As the comment states, the problem is supposed to be prevented via a C# named mutex that ensures only a single compile server is being started. Unfortunately, when using .NET with the Mono runtime, named mutexes are not fully implemented and only ensure mutual exclusion within the same process, not across multiple processes (see dotnet/runtime#48720). This is the same issue that is also present when running on a full Mono installation instead of on .NET.

The roslyn code already contains a workaround for that problem, originally added here: #31497, which will use file-based locking instead of named mutexes to guarantee mutual exclusion when running on Mono.

Unfortunately, this workaround does not actually help in our scenario - we are seeing multiple issues with the current implementation, which I'll describe here. I also have implemented a patch that fixes those problems for us; I'll post a PR shortly, and this issue is also intended to document the choices made in that PR.

  • First of all, use of ServerFileMutexPair is guarded via a check PlatformInformation.IsRunningOnMono, which is only true when running on an actual Mono installation, it is false when running on .NET 5 or later using the Mono runtime.

I've created a different test that is true whenever the Mono runtime is used (matches a similar test in arcade).

Then, when we actually get to use ServerFileMutexPair, we experience several other problems with its implementation:

  • The constructor uses a FileStream constructor with FileShare.None. This means that if any other process or thread holds the mutex, the constructor will throw an IOException because the FileShare.None prevents opening the lock file. This is not useful - normally, we'd then want to use TryLock to try and wait until the mutex becomes available, but we don't even get there. (I'm not sure why the current implementation worked originally on Mono - maybe FileShare.None was implemented differently there?)

  • Assuming that problem is fixed somehow, then the actual TryLock implementation isn't fully safe: it uses process-based record locking (FileStream.Lock), which only locks against other processes. But the mutex really needs to lock against other threads of the current process too, which the current implementation seems to simply ignore.

  • Just a minor issue compared to the others, but the lock files are never cleaned up and just fill up the tmp directory over time ...

I've reimplemented the logic to avoid use of FileStream.Lock and use the presence of the file itself (protected via FileShare.None) as the actual mutex. This should resolve all three of the above problems (and as a side effect allows implementing WasOpen via a simple check for file existence).

Finally, when all this is implemented we run into issues in the test suite:

  • Several tests create the server mutex themselves out of band to simulate a server already running. These tests currently hard-code a named mutex - they should be using the roslyn abstraction (either named mutex or ServerFileMutexPair) themselves. That was straightforward to fix.

  • Finally, there was a subtle race condition in the ServerFailsWithLongTempPathUnix test case, which tries to run the following scenario:

    • set TMPDIR to some directory with a long name
    • start the server
    • run a compile job against the server
    • reset TMPDIR

    But due to -what looks to me- incorrect use of async, it could happen that resetting TMPDIR actually was done already before starting the compile job, which has the effect that starting the server and running the job were using two different settings of TMPDIR. This means that the FileMutex lock files were in two different places, and we didn't actually have mutual exclusion any more. Fixed now by using an extra await - if I missed a more elegant solution here, I'd appreciate any suggestion!

Please let me know if the new locking approach looks reasonable - I'd be happy to work on alternative solutions as well if that would be preferable.

FYI @omajid @tmds @nealef
FYI @steveisok @directhex @crummel @leecow

@dotnet-issue-labeler dotnet-issue-labeler bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Oct 7, 2021
uweigand added a commit to uweigand/roslyn that referenced this issue Oct 7, 2021
* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes dotnet#57002
uweigand added a commit to uweigand/roslyn that referenced this issue Oct 7, 2021
* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes dotnet#57002
@jaredpar jaredpar self-assigned this Oct 7, 2021
@jaredpar jaredpar added Bug and removed untriaged Issues and PRs which have not yet been triaged by a lead labels Oct 7, 2021
@jaredpar jaredpar modified the milestones: 17.0, 17.1 Oct 7, 2021
uweigand added a commit to uweigand/roslyn that referenced this issue Oct 8, 2021
* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes dotnet#57002
uweigand added a commit to uweigand/roslyn that referenced this issue Oct 11, 2021
* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes dotnet#57002
uweigand added a commit to uweigand/roslyn that referenced this issue Oct 11, 2021
* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes dotnet#57002
uweigand added a commit to uweigand/roslyn that referenced this issue Oct 11, 2021
* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes dotnet#57002
jaredpar pushed a commit that referenced this issue Oct 20, 2021
* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes #57002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants