Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unix arm64 atomics #71512

Merged
merged 10 commits into from
Jul 7, 2022
Merged

Unix arm64 atomics #71512

merged 10 commits into from
Jul 7, 2022

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Jun 30, 2022

Use clang's target feature to mark methods as "lse" which would let us use atomics instruction on machines that supports that feature and falling back to the existing ldar/stlr.

image

Improvements on benchmarks mentioned in #70921 (comment):

Threads main (msec) PR (msec) Diff
1 251.5455 237.6332 -5.53%
2 1143.169 502.7126 -56.02%
4 1156.102 816.6927 -29.36%
8 1196.467 1036.283 -13.39%
16 1665.083 1149.873 -30.94%
32 1921.479 1177.914 -38.70%
64 1868.538 1326.892 -28.99%
80 2127.106 1571.951 -26.10%
128 2245.763 1598.338 -28.83%
256 2253.283 1462.568 -35.09%
MAIN numbers
main
1. 1 threads took 252.699 msec.
2. 1 threads took 251.5171 msec.
3. 1 threads took 251.0927 msec.
4. 1 threads took 251.1253 msec.
5. 1 threads took 251.2933 msec.
---------------------------
1. 2 threads took 1426.6813 msec.
2. 2 threads took 1379.4319 msec.
3. 2 threads took 722.4333 msec.
4. 2 threads took 769.4293 msec.
5. 2 threads took 1417.8713 msec.
---------------------------
1. 4 threads took 1150.6362 msec.
2. 4 threads took 1155.872 msec.
3. 4 threads took 1192.2462 msec.
4. 4 threads took 1101.9667 msec.
5. 4 threads took 1179.7866 msec.
---------------------------
1. 8 threads took 1141.091 msec.
2. 8 threads took 1156.4886 msec.
3. 8 threads took 1190.6568 msec.
4. 8 threads took 1238.2772 msec.
5. 8 threads took 1255.8233 msec.
---------------------------
1. 16 threads took 1603.1329 msec.
2. 16 threads took 1593.65 msec.
3. 16 threads took 1709.6671 msec.
4. 16 threads took 1720.4286 msec.
5. 16 threads took 1698.5372 msec.
---------------------------
1. 32 threads took 1974.5259 msec.
2. 32 threads took 1980.6739 msec.
3. 32 threads took 1843.1776 msec.
4. 32 threads took 1902.8479 msec.
5. 32 threads took 1906.1697 msec.
---------------------------
1. 64 threads took 1858.92 msec.
2. 64 threads took 1878.6331 msec.
3. 64 threads took 1877.9393 msec.
4. 64 threads took 1882.9364 msec.
5. 64 threads took 1844.2635 msec.
---------------------------
1. 80 threads took 2090.5906 msec.
2. 80 threads took 2191.8835 msec.
3. 80 threads took 2034.5112 msec.
4. 80 threads took 2144.1358 msec.
5. 80 threads took 2174.4106 msec.
---------------------------
1. 128 threads took 2265.2091 msec.
2. 128 threads took 2283.7954 msec.
3. 128 threads took 2261.3203 msec.
4. 128 threads took 2172.7291 msec.
5. 128 threads took 2210.5374 msec.
---------------------------
1. 256 threads took 2275.3843 msec.
2. 256 threads took 2264.395 msec.
3. 256 threads took 2212.0343 msec.
4. 256 threads took 2263.581 msec.
5. 256 threads took 2251.0193 msec.
---------------------------
PR numbers
PR
1. 1 threads took 238.4891 msec.
2. 1 threads took 237.2826 msec.
3. 1 threads took 237.4311 msec.
4. 1 threads took 237.2936 msec.
5. 1 threads took 237.6696 msec.
---------------------------
1. 2 threads took 509.3376 msec.
2. 2 threads took 494.3586 msec.
3. 2 threads took 501.2075 msec.
4. 2 threads took 461.6739 msec.
5. 2 threads took 546.9855 msec.
---------------------------
1. 4 threads took 830.7432 msec.
2. 4 threads took 844.639 msec.
3. 4 threads took 844.3923 msec.
4. 4 threads took 797.2337 msec.
5. 4 threads took 766.4552 msec.
---------------------------
1. 8 threads took 1098.5938 msec.
2. 8 threads took 1049.9092 msec.
3. 8 threads took 1054.3204 msec.
4. 8 threads took 968.63 msec.
5. 8 threads took 1009.9627 msec.
---------------------------
1. 16 threads took 1147.6109 msec.
2. 16 threads took 1151.3432 msec.
3. 16 threads took 1174.0256 msec.
4. 16 threads took 1123.5942 msec.
5. 16 threads took 1152.791 msec.
---------------------------
1. 32 threads took 1184.7472 msec.
2. 32 threads took 1173.7704 msec.
3. 32 threads took 1158.7283 msec.
4. 32 threads took 1154.133 msec.
5. 32 threads took 1218.1895 msec.
---------------------------
1. 64 threads took 1310.8412 msec.
2. 64 threads took 1241.5682 msec.
3. 64 threads took 1362.2923 msec.
4. 64 threads took 1360.6422 msec.
5. 64 threads took 1359.1156 msec.
---------------------------
1. 80 threads took 1631.0378 msec.
2. 80 threads took 1606.5723 msec.
3. 80 threads took 1491.7627 msec.
4. 80 threads took 1611.5671 msec.
5. 80 threads took 1518.8149 msec.
---------------------------
1. 128 threads took 1639.2338 msec.
2. 128 threads took 1662.3255 msec.
3. 128 threads took 1437.6061 msec.
4. 128 threads took 1654.1847 msec.
5. 128 threads took 1417.8441 msec.
---------------------------
1. 256 threads took 1479.4628 msec.
2. 256 threads took 1479.5195 msec.
3. 256 threads took 1433.6269 msec.
4. 256 threads took 1467.0887 msec.
5. 256 threads took 1453.1409 msec.
---------------------------

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@kunalspathak
Copy link
Member Author

kunalspathak commented Jul 4, 2022

I tried to incorporate feedback of turning this always ON for OSX/Arm64. Since I don't have that machine, I tried to enable it for Unix/Arm64, just to see it compiles and behave correctly.

Left= Ideal behaviour where we will use Lse_* implementation
Right= What it would look like if LSE_DEFAULT_ENABLED=true

image

What I observe is, in Left, we just jump to the inlined implementation where as on right side, it still does some function multiversioning magic and goes to the (I believe) OS implementation of acq_release. See below the point where atomics instructions are executed.

Here too, left is LSE_INSTRUCTIONS_ENABLED_BY_DEFAULT=false and right is LSE_INSTRUCTIONS_ENABLED_BY_DEFAULT=true.

image

I will ask someone to validate what they see on OSX.

@kunalspathak kunalspathak marked this pull request as ready for review July 5, 2022 16:11
@kunalspathak
Copy link
Member Author

@jkotas @janvorli @davidwrighton @dotnet/jit-contrib

CC: @Maoni0

#if defined(LSE_INSTRUCTIONS_ENABLED_BY_DEFAULT)

#define Define_InterlockMethod(RETURN_TYPE, METHOD_DECL, METHOD_INVOC, INTRINSIC_NAME) \
EXTERN_C PALIMPORT inline RETURN_TYPE PALAPI METHOD_DECL \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the #71026 (comment), just to make sure that this does generate LSE, I believe this should also be attributed with "lse" and (noinline)

@EgorBo
Copy link
Member

EgorBo commented Jul 14, 2022

Improvement son Linux-arm64: dotnet/perf-autofiling-issues#6770

@ghost ghost locked as resolved and limited conversation to collaborators Aug 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants