Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

Closed
EgorBo opened this issue Apr 10, 2022 · 12 comments
Closed
Assignees
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Apr 10, 2022

All Interlocked functions from gcenv.interlocked.inl seem to insert full memory barriers, e.g. consider this usage (GC seems to heavily use them under BACKGROUND_GC):

void GCHeap::SetSuspensionPending(bool fSuspensionPending)
{
    if (fSuspensionPending)
    {
        Interlocked::Increment(&g_fSuspensionPending);
    }
    else
    {
        Interlocked::Decrement(&g_fSuspensionPending);
    }
}

Disassembled on Apple M1:
image

Interlocked is implemented like this:

template <typename T>
__forceinline T Interlocked::Increment(T volatile *addend)
{
#ifdef _MSC_VER
    static_assert(sizeof(long) == sizeof(T), "Size of long must be the same as size of T");
    return _InterlockedIncrement((long*)addend);
#else
    T result = __sync_add_and_fetch(addend, 1);
    ArmInterlockedOperationBarrier();
    return result;
#endif
}

see godbolt: https://godbolt.org/z/3jPx3Mz14

From my understanding we don't need full memory barriers in the case when we use 8.1 atomics. Same applies to all Interlocked functions in gcenv.interlocked.inl
These barriers are needed on 8.0 where C++ compilers lower builtin atomic intrinsics without them for some reason (see https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/)

JIT does the same, e.g. for C#:

void Foo(ref int x) => Interlocked.Increment(ref x);

it produces on arm64-8.0:

        885FFC23          ldaxr   w3, [x1]
        11000462          add     w2, w3, #1
        8800FC22          stlxr   w0, w2, [x1]
        35FFFFA0          cbnz    w0, G_M24917_IG02
        D5033BBF          dmb     ish  ;; <----

and this on >=8.1:

        52800020          mov     w0, #1
        B8E00020          ldaddal w0, w0, [x1]

cc @VSadov

PS: Yes, when we compile CoreCLR for Apple M1 we unintentionally use -mcpu=apple-m1 and use all the new shiny instructions e.g. arm v8.3's ldapr, 8.1 atomics, etc.. (well, makes sense 🤷‍♂️)

@EgorBo EgorBo added the tenet-performance Performance related issue label Apr 10, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 10, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Apr 10, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

All Interlocked functions from gcenv.interlocked.inl seem to insert full memory barriers, e.g. consider this usage (from GC):

void GCHeap::SetSuspensionPending(bool fSuspensionPending)
{
    if (fSuspensionPending)
    {
        Interlocked::Increment(&g_fSuspensionPending);
    }
    else
    {
        Interlocked::Decrement(&g_fSuspensionPending);
    }
}

Disassembled on Apple M1:
image

Interlocked is implemented like this:

template <typename T>
__forceinline T Interlocked::Increment(T volatile *addend)
{
#ifdef _MSC_VER
    static_assert(sizeof(long) == sizeof(T), "Size of long must be the same as size of T");
    return _InterlockedIncrement((long*)addend);
#else
    T result = __sync_add_and_fetch(addend, 1);
    ArmInterlockedOperationBarrier();
    return result;
#endif
}

From my understanding we don't need full memory barriers in the case when we use 8.1 atomics. Same applies to all Interlocked functions in gcenv.interlocked.inl

cc @VSadov

PS: Yes, when we compile CoreCLR for Apple M1 we unintentionally use -mcpu=native and use all the new shiny instructions e.g. arm v8.3's ldapr, 8.1 atomics, etc..

Author: EgorBo
Assignees: -
Labels:

tenet-performance, area-GC-coreclr, untriaged

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2022

I removed these barriers and lowered number of dmb ish from 1724 to 71 in libcoreclr.dylib (GC is included)

Will test it on our TE infra

@VSadov
Copy link
Member

VSadov commented Apr 11, 2022

Right. If -mcpu=apple-m1 is guaranteed to use 8.1 atomic instructions, there is no need for the extra membarrier.

Also - I think GC has its own copy of Interlocked, separate from what runtime uses. I wonder if Interlocked in the runtime could use the same change.

@Maoni0
Copy link
Member

Maoni0 commented Apr 11, 2022

thank you very much for noticing this @EgorBo ! how stressful are tests run with TE infra (I thought they were mostly small perf tests?)? we do have a stress infra for GC.

@EgorBo
Copy link
Member Author

EgorBo commented Apr 11, 2022

thank you very much for noticing this @EgorBo ! how stressful are tests run with TE infra (I thought they were mostly small perf tests?)? we do have a stress infra for GC.

Sure, will try the stress infra, thanks!

@janvorli
Copy link
Member

Is this something that we still want to change for 7.0?

@EgorBo
Copy link
Member Author

EgorBo commented Jun 15, 2022

I know that @kunalspathak has some progress in a related issue and I personally probably will just remove that explicit memory barrier for OSX-arm64 at least.

@janvorli janvorli removed the untriaged New issue has not been triaged by the area owner label Jun 15, 2022
@janvorli janvorli added this to the 7.0.0 milestone Jun 15, 2022
@kunalspathak
Copy link
Member

That's right. I am currently working on a prototype to have this removed and hoping to make it to 7.0

@kunalspathak
Copy link
Member

@Maoni0 - Do we want to extend #71512 for GC code. When I tried to do something similar for Windows/arm64 in #71169, we didn't see much benefits and hence didn't optimize windows/arm64.

@Maoni0
Copy link
Member

Maoni0 commented Jul 8, 2022

thanks @kunalspathak. unless you believe the linux scenario is different (like the full barrier we are doing is somehow a lot more expensive on linux than on windows), I would say let's not do this for linux either.

@kunalspathak
Copy link
Member

Sounds good. I will close this issue then.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants