Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use full memory barrier for osx/arm64 #71026

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions src/coreclr/gc/env/gcenv.interlocked.inl
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,12 @@
#ifndef _MSC_VER
__forceinline void Interlocked::ArmInterlockedOperationBarrier()
{
#ifdef HOST_ARM64
#if defined(HOST_ARM64) || defined(HOST_LOONGARCH64)
#if !defined(HOST_OSX)
// See PAL_ArmInterlockedOperationBarrier() in the PAL
__sync_synchronize();
#endif // HOST_ARM64
#ifdef HOST_LOONGARCH64
__sync_synchronize();
#endif //HOST_LOONGARCH64
#endif // !HOST_OSX
#endif // HOST_ARM64 || HOST_LOONGARCH64
}
#endif // !_MSC_VER

Expand Down
11 changes: 6 additions & 5 deletions src/coreclr/pal/inc/pal.h
Original file line number Diff line number Diff line change
Expand Up @@ -3447,7 +3447,8 @@ BitScanReverse64(

FORCEINLINE void PAL_ArmInterlockedOperationBarrier()
{
#ifdef HOST_ARM64
#if defined(HOST_ARM64) || defined(HOST_LOONGARCH64)
#if !defined(HOST_OSX)
// On arm64, most of the __sync* functions generate a code sequence like:
// loop:
// ldaxr (load acquire exclusive)
Expand All @@ -3460,10 +3461,10 @@ FORCEINLINE void PAL_ArmInterlockedOperationBarrier()
// require the load to occur after the store. This memory barrier should be used following a call to a __sync* function to
// prevent that reordering. Code generated for arm32 includes a 'dmb' after 'cbnz', so no issue there at the moment.
__sync_synchronize();
#endif // HOST_ARM64
#ifdef HOST_LOONGARCH64
__sync_synchronize();
#endif
#else
// For OSX Arm64, the default Arm architecture is v8.1 which uses atomic instructions that don't need a full barrier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the C/C++ compiler guaranteed to use the newer atomic instructions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should add an explicit flag for clang to use arm8.1 or e.g. mcpu=apple-m1 Currently, it relies on my observations that by default Clang targets >Arm 8.0 on M1 but if Apple decides to change the default internally we might end up in a situation where these compiler intrinsics will be lowered to 8.0 and without the memory barrier = potential non-reproduceable race conditions somewhere in the vm

Copy link
Member

@jkotas jkotas Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does passing in mcpu=apple-m1 guarantee that the compiler is only ever use the new instructions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM maps apple-m1 to ARMV8_5A as seen in https:/llvm/llvm-project/blob/5ba0a9571b3ee3bc76f65e16549012a440d5a0fb/llvm/include/llvm/Support/AArch64TargetParser.def#L256-L257. However, I think the concern is valid and the full proof way to address it is to check explicitly the way it is done for windows counterpart in #70921. I am working on PR that will add similar check for linux-arm64 (reason stated in #70921 (comment)), so it should take care of these things for osx as well.

Copy link
Member

@EgorBo EgorBo Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think inline asm solves all problems here (might be tricky with templates)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can write a small test that validates that the intrinsic is lowered into LSE 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think you can reliably test for this. For example, you may see the old instruction only when there is a certain addressing mode needed or only when the code is cold.

Copy link
Member

@VSadov VSadov Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it use casal in debug? Can it switch to old LL/SC helper because of register pressure or if the old implementation is one day found faster (it could be).

It feels like inline asm could have more reliable guarantees.

#endif // !HOST_OSX
#endif// HOST_ARM64 || HOST_LOONGARCH64
}

/*++
Expand Down