Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Fix LowerHWIntrinsicGetElement for xarch #104987

Merged
merged 5 commits into from
Jul 18, 2024

Conversation

AndyAyersMS
Copy link
Member

Ensure we don't reorder evaluation and change exceptional behavior.

Closes #89565.

Ensure we don't reorder evaluation and change exceptional behavior.

Closes dotnet#89565.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 16, 2024
@AndyAyersMS
Copy link
Member Author

@jakobbotsch PTAL
cc @dotnet/jit-contrib

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Were the diffs bad if we just skipped the optimization when we cannot move the indir?

@AndyAyersMS
Copy link
Member Author

LGTM otherwise. Were the diffs bad if we just skipped the optimization when we cannot move the indir?

I don't recall -- let me look.

@AndyAyersMS
Copy link
Member Author

From SPMI the diffs are somewhat bigger, eg in ASP.NET:

With Fix

[08:46:54] Top method regressions (bytes):
[08:46:54]            6 (13.33% of base) : 66717.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Instrumented Tier0)
[08:46:54]            6 (13.33% of base) : 10948.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Tier0)
[08:46:54]            6 (13.33% of base) : 20741.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Tier0)
[08:46:54]            6 (13.33% of base) : 64929.dasm - System.Runtime.Intrinsics.Vector128`1[int]:get_Item(int):int:this (Tier0)
[08:46:54]            6 (13.04% of base) : 58073.dasm - System.Runtime.Intrinsics.Vector128`1[long]:get_Item(int):long:this (Tier0)
[08:46:54]            6 (13.04% of base) : 36918.dasm - System.Runtime.Intrinsics.Vector128`1[long]:get_Item(int):long:this (Tier0)
[08:46:54]            6 (13.04% of base) : 64402.dasm - System.Runtime.Intrinsics.Vector256`1[long]:get_Item(int):long:this (Tier0)
[08:46:54]            2 ( 8.70% of base) : 152279.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Tier1)

With Avoid

[11:41:47] Top method regressions (bytes):
[11:41:47]           14 (60.87% of base) : 152279.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Tier1)
[11:41:47]           13 (28.89% of base) : 66717.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Instrumented Tier0)
[11:41:47]           13 (28.89% of base) : 20741.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Tier0)
[11:41:47]           13 (28.89% of base) : 10948.dasm - System.Numerics.Vector`1[int]:get_Item(int):int:this (Tier0)
[11:41:47]           13 (28.26% of base) : 64402.dasm - System.Runtime.Intrinsics.Vector256`1[long]:get_Item(int):long:this (Tier0)
[11:41:47]           10 (22.22% of base) : 64929.dasm - System.Runtime.Intrinsics.Vector128`1[int]:get_Item(int):int:this (Tier0)
[11:41:47]           10 (21.74% of base) : 36918.dasm - System.Runtime.Intrinsics.Vector128`1[long]:get_Item(int):long:this (Tier0)
[11:41:47]           10 (21.74% of base) : 58073.dasm - System.Runtime.Intrinsics.Vector128`1[long]:get_Item(int):long:this (Tier0)

image

There are only a few diffs in non get_Item methods.

@AndyAyersMS
Copy link
Member Author

NAOT failure looks to be related:

  ILC: D:\a\_work\1\s\src\coreclr\jit\lsrabuild.cpp:2601
  ILC: Assertion failed '!"Expected empty defList at end of block"' in 'System.Numerics.Matrix4x4:get_Item(int,int):float:this' during 'Linear scan register alloc' (IL size 14; hash 0x67a5e306; FullOpts)

@AndyAyersMS
Copy link
Member Author

Ah, other failures too.

@AndyAyersMS AndyAyersMS merged commit 2d17105 into dotnet:main Jul 18, 2024
114 checks passed
@tannergooding
Copy link
Member

tannergooding commented Jul 18, 2024

@AndyAyersMS, this seems to have inserted quite a few null checks for the implicit this parameter on structs:

        ; byrRegs +[rcx]
+       cmp      byte  ptr [rcx], cl
        cmp      edx, 4
        jae      SHORT G_M63163_IG04
        vmovsd   xmm0, qword ptr [rcx+8*rdx]
-						;; size=10 bbWeight=1 PerfScore 5.25
+						;; size=12 bbWeight=1 PerfScore 8.25

Is that expected? We normally don't see such null checks (see for example: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACXDKAVzAwYGUGBvGhoIYATCK2AAbGAxgAGANwCho8VJnlF1IQ3oixk6QHEYGAJoAKAJQMAvAD51mgL5A==)

@AndyAyersMS
Copy link
Member Author

That's kind of the point. It adds a handful of nullchecks in places where the JIT doesn't or can't prove the vector address won't be null, since the dereference happens before the element access.

@tannergooding
Copy link
Member

It looks to be generally inconsistent with how we handle it for spans and other cases: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4axAMwNcGKBzAYGshgG8aDKwwAmEDsGkMYdDdWu37jmM/JuPltYA2gCyMBgAFhA2AJLikgAUYZHRcWKSAPJifBBcuCwAchAxXJK8XOUA5gCUALqBVtqeDk4A4uEAGgnVDAC8AHwMXDAA7iZi2FwAPHYtMP0JsABmznTVQXS1/tYNTDqz3gztGACa3X2Dw2OyE9MH0gvLvuubbgC+NJo6ZAxCNBbuaxNbi4bBLHz3HxdWQAKgYYh6A3hcH6xy61W2Vl2wLyYIhXicZ1h8MRgzEKOOZwxNDeQA=

I understand the intent of doing null checks here and think that in a case like p->GetElement(0) it's goodness, but it also appears to be inconsistent with general handling of the implicit this parameter for structs and so it isn't clear to me that all the regressions are "correct".

I think the actual issue is just that the managed fallback is deferring to the static GetElement and so its forcing the entire 16-byte vector to dereference rather than directly accessing the relevant element (hence causing a nullref prior to the bounds check). Instead we should be doing the bounds check up front and then deferring to GetElementUnsafe to ensure the behavior is deterministic.

@tannergooding
Copy link
Member

tannergooding commented Jul 18, 2024

That is when DOTNET_EnableAVX=0 we're doing:

Vector256<float> tmp = this;
if (index < 0 || index >= Count)
    throw ArgumentOutOfRangeException();
return Unsafe.ReadUnaligned(ref Unsafe.Add(ref Unsafe.As<Vector256<float>, float>(ref tmp), index));

While when we are accelerated we're doing:

if (index < 0 || index >= Count)
    throw ArgumentOutOfRangeException();
return Unsafe.ReadUnaligned(ref Unsafe.Add(ref Unsafe.As<Vector256<float>, float>(ref this), index));

and therefore not doing the full dereference first. Not doing the full dereference is the more desirable/better behavior here, so we really just need to fix the managed fallback I think

@jakobbotsch
Copy link
Member

@tannergooding The JIT is internally inconsistent because of this bug -- so this cannot just be fixed on the managed side. We have GetElement tree that looks like this:

N008 ( 11, 13) [000007] ---XG------                         └──▌  HWINTRINSIC float  float GetElement <l:$2c3, c:$2c2>
N002 (  3,  2) [000001] ---XG------                            ├──▌  IND       simd8  <l:$240, c:$241>
N001 (  1,  1) [000000] -----------                            │  └──▌  LCL_VAR   byref  V00 this         u:1 (last use) $80
N007 (  7, 10) [000006] ---X-------                            └──▌  COMMA     int    $280
N005 (  6,  9) [000005] ---X-------                               ├──▌  BOUNDS_CHECK_ArgRng void   $203
N003 (  1,  1) [000004] -----------                               │  ├──▌  LCL_VAR   int    V01 arg1         u:1 $c0
N004 (  1,  1) [000003] -----------                               │  └──▌  CNS_INT   int    2 $44
N006 (  1,  1) [000002] -----------                               └──▌  LCL_VAR   int    V01 arg1         u:1 (last use) $c0

This requires the indirection to be evaluated before the bounds check. The lowering transformation being fixed here is reordering the evaluation order of the operands. Alternatively this IR should not have been created like this in the first place, but that's still a fix somewhere else within the JIT.

#83005 (comment) is the original context.

@tannergooding
Copy link
Member

I think the shape here is wrong in that we've produced COMMA(BOUNDS_CHECK_ArgRng, op2). This should've been COMMA(BOUNDS_CHECK_ArgRng, GetElement(op1, op2))

The bounds check is meant to prevent the intrinsic (and any operand evaluation) from executing. That is, it's meant to be the first thing done before we do anything else.

@tannergooding
Copy link
Member

Although I guess that would still be potentially problematic with regards to non accelerated handling since GetElement takes Vector256<T> by value
and therefore it'd force the indirection to occur before get element happens

so I guess the real issue is we didn't spill op1 when inserting the bounds check during import

@github-actions github-actions bot locked and limited conversation to collaborators Aug 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: LowerHWIntrinsicGetElement can reorder trees unsafely
4 participants