gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

EgorBo · 2022-04-10T17:25:44Z

All Interlocked functions from gcenv.interlocked.inl seem to insert full memory barriers, e.g. consider this usage (GC seems to heavily use them under BACKGROUND_GC):

void GCHeap::SetSuspensionPending(bool fSuspensionPending)
{
    if (fSuspensionPending)
    {
        Interlocked::Increment(&g_fSuspensionPending);
    }
    else
    {
        Interlocked::Decrement(&g_fSuspensionPending);
    }
}

Disassembled on Apple M1:

Interlocked is implemented like this:

template <typename T>
__forceinline T Interlocked::Increment(T volatile *addend)
{
#ifdef _MSC_VER
    static_assert(sizeof(long) == sizeof(T), "Size of long must be the same as size of T");
    return _InterlockedIncrement((long*)addend);
#else
    T result = __sync_add_and_fetch(addend, 1);
    ArmInterlockedOperationBarrier();
    return result;
#endif
}

see godbolt: https://godbolt.org/z/3jPx3Mz14

From my understanding we don't need full memory barriers in the case when we use 8.1 atomics. Same applies to all Interlocked functions in gcenv.interlocked.inl
These barriers are needed on 8.0 where C++ compilers lower builtin atomic intrinsics without them for some reason (see https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/)

JIT does the same, e.g. for C#:

void Foo(ref int x) => Interlocked.Increment(ref x);

it produces on arm64-8.0:

        885FFC23          ldaxr   w3, [x1]
        11000462          add     w2, w3, #1
        8800FC22          stlxr   w0, w2, [x1]
        35FFFFA0          cbnz    w0, G_M24917_IG02
        D5033BBF          dmb     ish  ;; <----

and this on >=8.1:

        52800020          mov     w0, #1
        B8E00020          ldaddal w0, w0, [x1]

cc @VSadov

PS: Yes, when we compile CoreCLR for Apple M1 we unintentionally use -mcpu=apple-m1 and use all the new shiny instructions e.g. arm v8.3's ldapr, 8.1 atomics, etc.. (well, makes sense 🤷‍♂️)

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-04-10T17:25:50Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-04-10T17:26:00Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

All Interlocked functions from gcenv.interlocked.inl seem to insert full memory barriers, e.g. consider this usage (from GC):

void GCHeap::SetSuspensionPending(bool fSuspensionPending)
{
    if (fSuspensionPending)
    {
        Interlocked::Increment(&g_fSuspensionPending);
    }
    else
    {
        Interlocked::Decrement(&g_fSuspensionPending);
    }
}

Disassembled on Apple M1:

Interlocked is implemented like this:

template <typename T>
__forceinline T Interlocked::Increment(T volatile *addend)
{
#ifdef _MSC_VER
    static_assert(sizeof(long) == sizeof(T), "Size of long must be the same as size of T");
    return _InterlockedIncrement((long*)addend);
#else
    T result = __sync_add_and_fetch(addend, 1);
    ArmInterlockedOperationBarrier();
    return result;
#endif
}

From my understanding we don't need full memory barriers in the case when we use 8.1 atomics. Same applies to all Interlocked functions in gcenv.interlocked.inl

cc @VSadov

PS: Yes, when we compile CoreCLR for Apple M1 we unintentionally use -mcpu=native and use all the new shiny instructions e.g. arm v8.3's ldapr, 8.1 atomics, etc..

Author:	EgorBo
Assignees:	-
Labels:	`tenet-performance`, `area-GC-coreclr`, `untriaged`
Milestone:	-

EgorBo · 2022-04-10T17:57:37Z

I removed these barriers and lowered number of dmb ish from 1724 to 71 in libcoreclr.dylib (GC is included)

Will test it on our TE infra

VSadov · 2022-04-11T00:31:07Z

Right. If -mcpu=apple-m1 is guaranteed to use 8.1 atomic instructions, there is no need for the extra membarrier.

Also - I think GC has its own copy of Interlocked, separate from what runtime uses. I wonder if Interlocked in the runtime could use the same change.

Maoni0 · 2022-04-11T07:55:49Z

thank you very much for noticing this @EgorBo ! how stressful are tests run with TE infra (I thought they were mostly small perf tests?)? we do have a stress infra for GC.

EgorBo · 2022-04-11T13:42:24Z

thank you very much for noticing this @EgorBo ! how stressful are tests run with TE infra (I thought they were mostly small perf tests?)? we do have a stress infra for GC.

Sure, will try the stress infra, thanks!

janvorli · 2022-06-15T21:33:16Z

Is this something that we still want to change for 7.0?

EgorBo · 2022-06-15T21:48:11Z

I know that @kunalspathak has some progress in a related issue and I personally probably will just remove that explicit memory barrier for OSX-arm64 at least.

kunalspathak · 2022-06-16T00:01:01Z

That's right. I am currently working on a prototype to have this removed and hoping to make it to 7.0

kunalspathak · 2022-07-08T18:17:43Z

@Maoni0 - Do we want to extend #71512 for GC code. When I tried to do something similar for Windows/arm64 in #71169, we didn't see much benefits and hence didn't optimize windows/arm64.

Maoni0 · 2022-07-08T21:05:19Z

thanks @kunalspathak. unless you believe the linux scenario is different (like the full barrier we are doing is somehow a lot more expensive on linux than on windows), I would say let's not do this for linux either.

kunalspathak · 2022-07-08T23:38:59Z

Sounds good. I will close this issue then.

EgorBo added the tenet-performance Performance related issue label Apr 10, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 10, 2022

EgorBo added the area-GC-coreclr label Apr 10, 2022

jkotas added the arch-arm64 label Jun 15, 2022

janvorli removed the untriaged New issue has not been triaged by the area owner label Jun 15, 2022

janvorli added this to the 7.0.0 milestone Jun 15, 2022

kunalspathak mentioned this issue Jun 16, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

This was referenced Jun 20, 2022

Windows/Arm64: Use 8.1 atomic instructions if they are available #70921

Merged

Do not use full memory barrier for osx/arm64 #71026

Closed

JulieLeeMSFT assigned kunalspathak Jul 1, 2022

kunalspathak closed this as completed Jul 8, 2022

ghost locked as resolved and limited conversation to collaborators Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

EgorBo commented Apr 10, 2022 •

edited

Loading

dotnet-issue-labeler bot commented Apr 10, 2022

ghost commented Apr 10, 2022

EgorBo commented Apr 10, 2022 •

edited

Loading

VSadov commented Apr 11, 2022

Maoni0 commented Apr 11, 2022

EgorBo commented Apr 11, 2022

janvorli commented Jun 15, 2022

EgorBo commented Jun 15, 2022

kunalspathak commented Jun 16, 2022

kunalspathak commented Jul 8, 2022

Maoni0 commented Jul 8, 2022

kunalspathak commented Jul 8, 2022

gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

Comments

EgorBo commented Apr 10, 2022 • edited Loading

dotnet-issue-labeler bot commented Apr 10, 2022

ghost commented Apr 10, 2022

EgorBo commented Apr 10, 2022 • edited Loading

VSadov commented Apr 11, 2022

Maoni0 commented Apr 11, 2022

EgorBo commented Apr 11, 2022

janvorli commented Jun 15, 2022

EgorBo commented Jun 15, 2022

kunalspathak commented Jun 16, 2022

kunalspathak commented Jul 8, 2022

Maoni0 commented Jul 8, 2022

kunalspathak commented Jul 8, 2022

EgorBo commented Apr 10, 2022 •

edited

Loading

EgorBo commented Apr 10, 2022 •

edited

Loading