Unix arm64 atomics #71512

kunalspathak · 2022-06-30T23:01:03Z

Use clang's target feature to mark methods as "lse" which would let us use atomics instruction on machines that supports that feature and falling back to the existing ldar/stlr.

Improvements on benchmarks mentioned in #70921 (comment):

Threads	main (msec)	PR (msec)	Diff
1	251.5455	237.6332	-5.53%
2	1143.169	502.7126	-56.02%
4	1156.102	816.6927	-29.36%
8	1196.467	1036.283	-13.39%
16	1665.083	1149.873	-30.94%
32	1921.479	1177.914	-38.70%
64	1868.538	1326.892	-28.99%
80	2127.106	1571.951	-26.10%
128	2245.763	1598.338	-28.83%
256	2253.283	1462.568	-35.09%

MAIN numbers

main
1. 1 threads took 252.699 msec.
2. 1 threads took 251.5171 msec.
3. 1 threads took 251.0927 msec.
4. 1 threads took 251.1253 msec.
5. 1 threads took 251.2933 msec.
---------------------------
1. 2 threads took 1426.6813 msec.
2. 2 threads took 1379.4319 msec.
3. 2 threads took 722.4333 msec.
4. 2 threads took 769.4293 msec.
5. 2 threads took 1417.8713 msec.
---------------------------
1. 4 threads took 1150.6362 msec.
2. 4 threads took 1155.872 msec.
3. 4 threads took 1192.2462 msec.
4. 4 threads took 1101.9667 msec.
5. 4 threads took 1179.7866 msec.
---------------------------
1. 8 threads took 1141.091 msec.
2. 8 threads took 1156.4886 msec.
3. 8 threads took 1190.6568 msec.
4. 8 threads took 1238.2772 msec.
5. 8 threads took 1255.8233 msec.
---------------------------
1. 16 threads took 1603.1329 msec.
2. 16 threads took 1593.65 msec.
3. 16 threads took 1709.6671 msec.
4. 16 threads took 1720.4286 msec.
5. 16 threads took 1698.5372 msec.
---------------------------
1. 32 threads took 1974.5259 msec.
2. 32 threads took 1980.6739 msec.
3. 32 threads took 1843.1776 msec.
4. 32 threads took 1902.8479 msec.
5. 32 threads took 1906.1697 msec.
---------------------------
1. 64 threads took 1858.92 msec.
2. 64 threads took 1878.6331 msec.
3. 64 threads took 1877.9393 msec.
4. 64 threads took 1882.9364 msec.
5. 64 threads took 1844.2635 msec.
---------------------------
1. 80 threads took 2090.5906 msec.
2. 80 threads took 2191.8835 msec.
3. 80 threads took 2034.5112 msec.
4. 80 threads took 2144.1358 msec.
5. 80 threads took 2174.4106 msec.
---------------------------
1. 128 threads took 2265.2091 msec.
2. 128 threads took 2283.7954 msec.
3. 128 threads took 2261.3203 msec.
4. 128 threads took 2172.7291 msec.
5. 128 threads took 2210.5374 msec.
---------------------------
1. 256 threads took 2275.3843 msec.
2. 256 threads took 2264.395 msec.
3. 256 threads took 2212.0343 msec.
4. 256 threads took 2263.581 msec.
5. 256 threads took 2251.0193 msec.
---------------------------

PR numbers

PR
1. 1 threads took 238.4891 msec.
2. 1 threads took 237.2826 msec.
3. 1 threads took 237.4311 msec.
4. 1 threads took 237.2936 msec.
5. 1 threads took 237.6696 msec.
---------------------------
1. 2 threads took 509.3376 msec.
2. 2 threads took 494.3586 msec.
3. 2 threads took 501.2075 msec.
4. 2 threads took 461.6739 msec.
5. 2 threads took 546.9855 msec.
---------------------------
1. 4 threads took 830.7432 msec.
2. 4 threads took 844.639 msec.
3. 4 threads took 844.3923 msec.
4. 4 threads took 797.2337 msec.
5. 4 threads took 766.4552 msec.
---------------------------
1. 8 threads took 1098.5938 msec.
2. 8 threads took 1049.9092 msec.
3. 8 threads took 1054.3204 msec.
4. 8 threads took 968.63 msec.
5. 8 threads took 1009.9627 msec.
---------------------------
1. 16 threads took 1147.6109 msec.
2. 16 threads took 1151.3432 msec.
3. 16 threads took 1174.0256 msec.
4. 16 threads took 1123.5942 msec.
5. 16 threads took 1152.791 msec.
---------------------------
1. 32 threads took 1184.7472 msec.
2. 32 threads took 1173.7704 msec.
3. 32 threads took 1158.7283 msec.
4. 32 threads took 1154.133 msec.
5. 32 threads took 1218.1895 msec.
---------------------------
1. 64 threads took 1310.8412 msec.
2. 64 threads took 1241.5682 msec.
3. 64 threads took 1362.2923 msec.
4. 64 threads took 1360.6422 msec.
5. 64 threads took 1359.1156 msec.
---------------------------
1. 80 threads took 1631.0378 msec.
2. 80 threads took 1606.5723 msec.
3. 80 threads took 1491.7627 msec.
4. 80 threads took 1611.5671 msec.
5. 80 threads took 1518.8149 msec.
---------------------------
1. 128 threads took 1639.2338 msec.
2. 128 threads took 1662.3255 msec.
3. 128 threads took 1437.6061 msec.
4. 128 threads took 1654.1847 msec.
5. 128 threads took 1417.8441 msec.
---------------------------
1. 256 threads took 1479.4628 msec.
2. 256 threads took 1479.5195 msec.
3. 256 threads took 1433.6269 msec.
4. 256 threads took 1467.0887 msec.
5. 256 threads took 1453.1409 msec.
---------------------------

dotnet-issue-labeler · 2022-06-30T23:01:10Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

kunalspathak · 2022-07-04T18:21:26Z

I tried to incorporate feedback of turning this always ON for OSX/Arm64. Since I don't have that machine, I tried to enable it for Unix/Arm64, just to see it compiles and behave correctly.

Left= Ideal behaviour where we will use Lse_* implementation
Right= What it would look like if LSE_DEFAULT_ENABLED=true

What I observe is, in Left, we just jump to the inlined implementation where as on right side, it still does some function multiversioning magic and goes to the (I believe) OS implementation of acq_release. See below the point where atomics instructions are executed.

Here too, left is LSE_INSTRUCTIONS_ENABLED_BY_DEFAULT=false and right is LSE_INSTRUCTIONS_ENABLED_BY_DEFAULT=true.

I will ask someone to validate what they see on OSX.

kunalspathak · 2022-07-05T16:11:28Z

@jkotas @janvorli @davidwrighton @dotnet/jit-contrib

CC: @Maoni0

kunalspathak · 2022-07-05T17:34:10Z

src/coreclr/pal/inc/pal.h

+#if defined(LSE_INSTRUCTIONS_ENABLED_BY_DEFAULT)
+
+#define Define_InterlockMethod(RETURN_TYPE, METHOD_DECL, METHOD_INVOC, INTRINSIC_NAME) \
+EXTERN_C PALIMPORT inline RETURN_TYPE PALAPI METHOD_DECL \


From the #71026 (comment), just to make sure that this does generate LSE, I believe this should also be attributed with "lse" and (noinline)

EgorBo · 2022-07-14T16:45:48Z

Improvement son Linux-arm64: dotnet/perf-autofiling-issues#6770

kunalspathak added 4 commits June 30, 2022 15:34

Define_InterlockMethod macro

b20cf2a

compiler failure

8ed7421

fix build errors

7ead20a

Set g_arm64_atomics_present at common place

eb21958

ghost assigned kunalspathak Jun 30, 2022

This was referenced Jun 30, 2022

Try adding armv8-a+lse #71260

Closed

Do not use full memory barrier for osx/arm64 #71026

Closed

kunalspathak added 2 commits July 1, 2022 07:34

Fix the missing declaration

73ee366

Change TARGET_ARM64 => HOST_ARM64

71be17c

teo-tsirpanis added arch-arm64 area-PAL-coreclr labels Jul 1, 2022

kunalspathak added 3 commits July 1, 2022 15:14

Use LSE for InterlockedCompareExchange

f927cf2

Attempt to fix osx-arm64 build issue

b0de604

Introduce LSE_INSTRUCTIONS_ENABLED_BY_DEFAULT

800aa30

kunalspathak marked this pull request as ready for review July 5, 2022 16:11

kunalspathak commented Jul 5, 2022

View reviewed changes

Make sure that compiler knows that M1 has lse

fa2b92d

jkotas approved these changes Jul 7, 2022

View reviewed changes

This was referenced Jul 7, 2022

jit.1 work item failing on mono #67888

Closed

Test failure JIT/Performance/CodeQuality/BenchmarksGame/regex-redux/regex-redux-5/regex-redux-5.sh #66625

Closed

kunalspathak merged commit 10286e9 into dotnet:main Jul 7, 2022

kunalspathak deleted the unix_arm64_atomics branch July 7, 2022 17:14

This was referenced Jul 8, 2022

osx-arm64 optimal code generation #41128

Open

gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

Closed

jakobbotsch mentioned this pull request Jul 26, 2022

[Perf] Changes at 7/7/2022 11:40:51 PM dotnet/perf-autofiling-issues#6640

Open

ghost locked as resolved and limited conversation to collaborators Aug 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unix arm64 atomics #71512

Unix arm64 atomics #71512

kunalspathak commented Jun 30, 2022 •

edited

Loading

dotnet-issue-labeler bot commented Jun 30, 2022

kunalspathak commented Jul 4, 2022 •

edited

Loading

kunalspathak commented Jul 5, 2022

kunalspathak Jul 5, 2022

EgorBo commented Jul 14, 2022

Unix arm64 atomics #71512

Unix arm64 atomics #71512

Conversation

kunalspathak commented Jun 30, 2022 • edited Loading

dotnet-issue-labeler bot commented Jun 30, 2022

kunalspathak commented Jul 4, 2022 • edited Loading

kunalspathak commented Jul 5, 2022

kunalspathak Jul 5, 2022

Choose a reason for hiding this comment

EgorBo commented Jul 14, 2022

kunalspathak commented Jun 30, 2022 •

edited

Loading

kunalspathak commented Jul 4, 2022 •

edited

Loading