Skip to content

Commit

Permalink
Improve loop cloning, with debugging improvements (#55299)
Browse files Browse the repository at this point in the history
When loop cloning was creating cloning conditions, it was creating unnecessary bounds checks in some multi-dimensional array index cases. When creating a set of cloning conditions, first a null check is done, then an array length check is done, etc. Thus, the array length expression itself won't fault because we've already done a null check. And a subsequent array index expression won't fault (or need a bounds check) because we've already checked the array length (i.e., we've done a manual bounds check). So, stop creating the unnecessary bounds checks, and mark the appropriate instructions as non-faulting by clearing the GTF_EXCEPT bit.
	
Note that I did not turn on the code to clear GTF_EXCEPT for array length checks because it leads to negative downstream effects in CSE. Namely, there end up being array length expressions that are identical except for the exception bit. When CSE sees this, it gives up on creating a CSE, which leads to regressions in some cases where we don't CSE the array length expression.

Also, for multi-dimension jagged arrays, when optimizing the fast path, we were not removing as many bounds checks as we could. In particular, we weren't removing outer bounds checks, only inner ones. Add code to handle all the bounds checks.

There are some runtime improvements (measured via BenchmarkDotNet on the JIT microbenchmarks), but also some regressions, due, as far as I can tell, to the Intel jcc erratum performance impact. In particular, benchmark ludcmp shows up to a 9% regression due to a `jae` instruction in the hot loop now crossing a 32-byte boundary due to code changes earlier in the function affecting instruction alignment. The hot loop itself is exactly the same (module register allocation differences). As there is nothing that can be done (without mitigating the jcc erratum) -- it's "bad luck".

In addition to those functional changes, there are a number of debugging-related improvements:
1. Loop cloning: (a) Improved dumping of cloning conditions and other things, (b) remove an unnecessary member to `LcOptInfo`, (c) convert the `LoopCloneContext` raw arrays to `jitstd::vector` for easier debugging, as clrjit.natvis can be taught to understand them.
2. CSE improvements: (a) Add `getCSEAvailBit` and `getCSEAvailCrossCallBit` functions to avoid multiple hard-codings of these expresions, (b) stop printing all the details of the CSE dataflow to JitDump; just print the result, (c) add `optPrintCSEDataFlowSet` function to print the CSE dataflow set in symbolic form, not just the raw bits, (d) added `FMT_CSE` string to use for formatting CSE candidates, (e) added `optOptimizeCSEs` to the phase structure for JitDump output, (f) remove unused `optCSECandidateTotal` (remnant of Valnum + lexical CSE)
3. Alignment: (a) Moved printing of alignment boundaries from `emitIssue1Instr` to `emitEndCodeGen`, to avoid the possibility of reading an instruction beyond the basic block. Also, improved the Intel jcc erratum criteria calculations, (b) Change `align` instructions of zero size to have a zero PerfScore throughput number (since they don't generate code), (c) Add `COMPlus_JitDasmWithAlignmentBoundaries` to force disasm output to display alignment boundaries.
4. Codegen / Emitter: (a) Added `emitLabelString` function for constructing a string to display for a bound emitter label. Created `emitPrintLabel` to directly print the label, (b) Add `genInsDisplayName` function to create a string for use when outputting an instruction. For xarch, this prepends the "v" for SIMD instructions, as necessary. This is preferable to calling the raw `genInsName` function, (c) For each insGroup, created a debug-only list of basic blocks that contributed code to that insGroup. Display this set of blocks in the JitDump disasm output, with block ID. This is useful for looking at an IG, and finding the blocks in a .dot flow graph visualization that contributed to it, (d) remove unused `instDisp`
5. Clrjit.natvis: (a) add support for `jitstd::vector`, `JitExpandArray<T>`, `JitExpandArrayStack<T>`, `LcOptInfo`.
6. Misc: (a) When compacting an empty loop preheader block with a subsequent block, clear the preheader flag.

## benchmarks.run.windows.x64.checked.mch:

```

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 25504
Total bytes of diff: 25092
Total bytes of delta: -412 (-1.62% of base)
Total relative delta: -0.31
    diff is an improvement.
    relative diff is an improvement.
```
<details>

<summary>Detail diffs</summary>

```


Top file improvements (bytes):
         -92 : 14861.dasm (-2.57% of base)
         -88 : 2430.dasm (-0.77% of base)
         -68 : 12182.dasm (-3.82% of base)
         -48 : 24678.dasm (-1.61% of base)
         -31 : 21598.dasm (-5.13% of base)
         -26 : 21601.dasm (-4.57% of base)
         -21 : 25069.dasm (-7.14% of base)
         -16 : 14859.dasm (-1.38% of base)
         -11 : 14862.dasm (-1.35% of base)
          -6 : 21600.dasm (-1.83% of base)
          -5 : 25065.dasm (-0.58% of base)

11 total files with Code Size differences (11 improved, 0 regressed), 1 unchanged.

Top method improvements (bytes):
         -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
         -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
         -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
          -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this

Top method improvements (percentages):
         -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
          -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
         -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
         -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
          -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this

11 total methods with Code Size differences (11 improved, 0 regressed), 1 unchanged.

```

</details>

--------------------------------------------------------------------------------

```

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 38374.96
Total PerfScoreUnits of diff: 37914.07000000001
Total PerfScoreUnits of delta: -460.89 (-1.20% of base)
Total relative delta: -0.12
    diff is an improvement.
    relative diff is an improvement.
```
<details>

<summary>Detail diffs</summary>

```


Top file improvements (PerfScoreUnits):
     -220.67 : 24678.dasm (-1.74% of base)
      -99.27 : 14861.dasm (-2.09% of base)
      -66.30 : 21598.dasm (-1.41% of base)
      -18.73 : 2430.dasm (-0.28% of base)
      -18.40 : 21601.dasm (-1.37% of base)
       -9.73 : 25065.dasm (-0.56% of base)
       -9.05 : 14859.dasm (-0.77% of base)
       -5.51 : 21600.dasm (-0.77% of base)
       -4.15 : 12182.dasm (-0.17% of base)
       -3.92 : 14860.dasm (-0.32% of base)
       -3.46 : 25069.dasm (-2.31% of base)
       -1.70 : 14862.dasm (-0.20% of base)

12 total files with Perf Score differences (12 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
     -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
      -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
      -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
      -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this
       -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
       -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
       -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
       -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])

Top method improvements (percentages):
       -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
      -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
     -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
      -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this
       -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
      -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
       -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
       -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])

12 total methods with Perf Score differences (12 improved, 0 regressed), 0 unchanged.

```

</details>

--------------------------------------------------------------------------------

## coreclr_tests.pmi.windows.x64.checked.mch:

```

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 25430
Total bytes of diff: 24994
Total bytes of delta: -436 (-1.71% of base)
Total relative delta: -0.42
    diff is an improvement.
    relative diff is an improvement.
```
<details>

<summary>Detail diffs</summary>

```


Top file improvements (bytes):
         -92 : 194668.dasm (-2.57% of base)
         -68 : 194589.dasm (-3.82% of base)
         -48 : 248565.dasm (-1.61% of base)
         -32 : 249053.dasm (-3.58% of base)
         -31 : 251012.dasm (-5.13% of base)
         -26 : 251011.dasm (-4.57% of base)
         -19 : 248561.dasm (-6.76% of base)
         -16 : 194667.dasm (-1.38% of base)
         -15 : 252241.dasm (-0.72% of base)
         -12 : 252242.dasm (-0.81% of base)
         -11 : 194669.dasm (-1.35% of base)
          -9 : 246308.dasm (-1.06% of base)
          -9 : 246307.dasm (-1.06% of base)
          -9 : 246245.dasm (-1.06% of base)
          -9 : 246246.dasm (-1.06% of base)
          -6 : 228622.dasm (-0.77% of base)
          -6 : 251010.dasm (-1.83% of base)
          -5 : 248557.dasm (-0.61% of base)
          -4 : 249054.dasm (-0.50% of base)
          -4 : 249052.dasm (-0.47% of base)

22 total files with Code Size differences (22 improved, 0 regressed), 1 unchanged.

Top method improvements (bytes):
         -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
         -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
         -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
         -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
         -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
          -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
          -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
          -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
          -4 (-0.47% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()

Top method improvements (percentages):
         -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
         -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
          -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
         -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -3 (-1.11% of base) : 249057.dasm - SimpleArray_01.Test:Test2()
          -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
         -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
          -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
         -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
          -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
          -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()

22 total methods with Code Size differences (22 improved, 0 regressed), 1 unchanged.

```

</details>

--------------------------------------------------------------------------------

```

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 161610.68999999997
Total PerfScoreUnits of diff: 160290.10999999996
Total PerfScoreUnits of delta: -1320.58 (-0.82% of base)
Total relative delta: -0.20
    diff is an improvement.
    relative diff is an improvement.
```
<details>

<summary>Detail diffs</summary>

```


Top file improvements (PerfScoreUnits):
     -639.25 : 252241.dasm (-0.97% of base)
     -220.67 : 248565.dasm (-1.74% of base)
     -132.59 : 252242.dasm (-0.26% of base)
      -99.27 : 194668.dasm (-2.09% of base)
      -66.30 : 251012.dasm (-1.41% of base)
      -62.20 : 249053.dasm (-2.74% of base)
      -18.40 : 251011.dasm (-1.37% of base)
       -9.33 : 248557.dasm (-0.54% of base)
       -9.05 : 194667.dasm (-0.77% of base)
       -8.32 : 249054.dasm (-0.42% of base)
       -5.85 : 246308.dasm (-0.52% of base)
       -5.85 : 246307.dasm (-0.52% of base)
       -5.85 : 246245.dasm (-0.52% of base)
       -5.85 : 246246.dasm (-0.52% of base)
       -5.51 : 251010.dasm (-0.77% of base)
       -4.36 : 249052.dasm (-0.22% of base)
       -4.16 : 253363.dasm (-0.21% of base)
       -4.15 : 194589.dasm (-0.17% of base)
       -3.92 : 194666.dasm (-0.32% of base)
       -3.41 : 248561.dasm (-2.29% of base)

23 total files with Perf Score differences (23 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
     -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
     -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
     -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
      -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
      -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
      -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
       -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
       -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()
       -4.16 (-0.21% of base) : 253363.dasm - MatrixMul.Test:MatrixMul()
       -4.15 (-0.17% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
       -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
       -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)

Top method improvements (percentages):
      -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
       -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
      -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
     -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
       -2.70 (-1.71% of base) : 249057.dasm - SimpleArray_01.Test:Test2()
      -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
     -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
       -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
       -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
       -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
     -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
       -1.89 (-0.22% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
       -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()

23 total methods with Perf Score differences (23 improved, 0 regressed), 0 unchanged.

```

</details>

--------------------------------------------------------------------------------

## libraries.crossgen2.windows.x64.checked.mch:

```

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 10828
Total bytes of diff: 10809
Total bytes of delta: -19 (-0.18% of base)
Total relative delta: -0.00
    diff is an improvement.
    relative diff is an improvement.
```
<details>

<summary>Detail diffs</summary>

```


Top file improvements (bytes):
         -19 : 72504.dasm (-0.18% of base)

1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

Top method improvements (percentages):
         -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

1 total methods with Code Size differences (1 improved, 0 regressed), 0 unchanged.

```

</details>

--------------------------------------------------------------------------------

```

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 6597.12
Total PerfScoreUnits of diff: 6586.31
Total PerfScoreUnits of delta: -10.81 (-0.16% of base)
Total relative delta: -0.00
    diff is an improvement.
    relative diff is an improvement.
```
<details>

<summary>Detail diffs</summary>

```


Top file improvements (PerfScoreUnits):
      -10.81 : 72504.dasm (-0.16% of base)

1 total files with Perf Score differences (1 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
      -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

Top method improvements (percentages):
      -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

1 total methods with Perf Score differences (1 improved, 0 regressed), 0 unchanged.

```

</details>

--------------------------------------------------------------------------------



* Increase loop cloning max allowed condition blocks

Allows inner loop of 3-nested loops (e.g., Array2 benchmark)
to be cloned.

* Clear GTF_INX_RNGCHK bit on loop cloning created index nodes

to avoid unnecessary bounds checks.

Revert max cloning condition blocks to 3; allowing more doesn't
seem to improve performance (probably too many conditions before
a not-sufficiently-executed loop, at least for the Array2 benchmark)

* Remove outer index bounds checks

* Convert loop cloning data structures to `vector` for better debugging

* Improve CSE dump output

1. "#if 0" the guts of the CSE dataflow; that's not useful to most people.
2. Add readable CSE number output to the CSE dataflow set output
3. Add FMT_CSE to commonize CSE number output.
4. Add PHASE_OPTIMIZE_VALNUM_CSES to the pre-phase output "allow list"
and stop doing its own blocks/trees output.
5. Remove unused optCSECandidateTotal
6. Add functions `getCSEAvailBit` and `getCSEAvailCrossCallBit` to avoid
hand-coding these bit calculations in multiple places, for the CSE dataflow set bits.

* Mark cloned array indexes as non-faulting

When generating loop cloning conditions, mark array index expressions
as non-faulting, as we have already null- and range-checked the array
before generating an index expression.

I also added similary code to mark array length expressions as non-faulting,
for the same reason. However, that leads to CQ losses because of downstream
CSE effects.

* Don't count zero-sized align instructions in PerfScore

* Add COMPlus_JitDasmWithAlignmentBoundaries

This outputs the alignment boundaries without requiring outputting the actual addresses.
It makes it easier to diff changes.

* Improve bounds check output

* Improve emitter label printing

Create function for printing bound emitter labels.

Also, add debug code to associate a BasicBlock with an insGroup, and
output the block number and ID with the emitter label in JitDump, so it's easier
to find where a group of generated instructions came from.

* Formatting

* Clear BBF_LOOP_PREHEADER bit when compacting empty pre-header block

* Keep track of all basic blocks that contribute code to an insGroup

* Update display of Intel jcc erratum branches in dump

For instructions or instruction sequences which match the Intel jcc
erratum criteria, note that in the alignment boundary dump.

Also, a few fixes:
1. Move the alignment boundary dumping from `emitIssue1Instr` to
`emitEndCodeGen` to avoid the possibility of reading the next instruction in
a group when there is no next instruction.
2. Create `IsJccInstruction` and `IsJmpInstruction` functions for use by the
jcc criteria detection, and fix that detection to fix a few omissions/errors.
3. Change the jcc criteria detection to be hard-coded to 32 byte boundaries
instead of assuming `compJitAlignLoopBoundary` is 32.

An example:
```
    cmp      r11d, dword ptr [rax+8]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ...............................
    jae      G_M42486_IG103
```
In this case, the `cmp` doesn't cross the boundary, it is adjacent (the zero indicates the number of bytes
of the instruction which cross the boundary), followed by the `jae` which starts after the boundary.

Indicating the jcc erratum criteria can help point out potential performance issues due to unlucky
alignment of these instructions in asm diffs.

* Display full instruction name in alignment and other messages

XArch sometimes prepends a "v" to the instructions names from the instruction
table. Add a function `genInsDisplayName` to create the full instruction name
that should be displayed, and use that in most places an instruction name will
be displayed, such as in the alignment messages, and normal disassembly. Use
this instead of the raw `genInsName`.

This could be extended to handle arm32 appending an "s", but I didn't want to
touch arm32 with this change.

* Fix build

* Code review feedback

1. Rename GTF_INX_NONFAULTING to GTF_INX_NOFAULT to increase clarity compared
to existing GTF_IND_NONFAULTING.
2. Minor cleanup in getInsDisplayName.

* Formatting
  • Loading branch information
BruceForstall authored Jul 10, 2021
1 parent 1ff0b50 commit 02ccdac
Show file tree
Hide file tree
Showing 26 changed files with 730 additions and 373 deletions.
5 changes: 4 additions & 1 deletion src/coreclr/jit/block.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,10 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#include "jithashtable.h"

/*****************************************************************************/
typedef BitVec EXPSET_TP;
typedef BitVec EXPSET_TP;
typedef BitVec_ValArg_T EXPSET_VALARG_TP;
typedef BitVec_ValRet_T EXPSET_VALRET_TP;

#if LARGE_EXPSET
#define EXPSET_SZ 64
#else
Expand Down
51 changes: 51 additions & 0 deletions src/coreclr/jit/clrjit.natvis
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@ Licensed to the .NET Foundation under one or more agreements.
The .NET Foundation licenses this file to you under the MIT license.
-->

<!--
Visual Studio debugger visualizers for RyuJIT.
Documentation for VS natvis format: https://docs.microsoft.com/en-us/visualstudio/debugger/create-custom-views-of-native-objects?view=vs-2019
Documentation for VS debugger format specifiers: https://docs.microsoft.com/en-us/visualstudio/debugger/format-specifiers-in-cpp?view=vs-2019
-->

<AutoVisualizer xmlns="http://schemas.microsoft.com/vstudio/debugger/natvis/2010">

Expand Down Expand Up @@ -183,4 +190,48 @@ The .NET Foundation licenses this file to you under the MIT license.
</Expand>
</Type>

<Type Name="jitstd::vector&lt;*&gt;">
<DisplayString Condition="m_nSize > 0">size={m_nSize,d} capacity={m_nCapacity,d}</DisplayString>
<DisplayString Condition="m_nSize == 0">Empty</DisplayString>
<Expand>
<ArrayItems>
<Size>m_nSize</Size>
<ValuePointer>m_pArray</ValuePointer>
</ArrayItems>
</Expand>
</Type>

<Type Name="JitExpandArray&lt;*&gt;">
<DisplayString Condition="m_size > 0">size={m_size,d}</DisplayString>
<DisplayString Condition="m_size == 0">Empty</DisplayString>
<Expand>
<ArrayItems>
<Size>m_size</Size>
<ValuePointer>m_members</ValuePointer>
</ArrayItems>
</Expand>
</Type>

<Type Name="JitExpandArrayStack&lt;*&gt;">
<DisplayString Condition="m_size > 0">size={m_size,d} used={m_used,d}</DisplayString>
<DisplayString Condition="m_size == 0">Empty</DisplayString>
<Expand>
<ArrayItems>
<Size>m_used</Size>
<ValuePointer>m_members</ValuePointer>
</ArrayItems>
</Expand>
</Type>

<!-- Loop cloning -->

<!-- LcOptInfo is really one of its derived types, so figure out which one. Set Inheritable=false to prevent recursion. -->
<Type Name="LcOptInfo" Inheritable="false">
<DisplayString>{optType,en}</DisplayString>
<Expand>
<ExpandedItem Condition="optType == LcOptInfo::OptType::LcJaggedArray">(LcJaggedArrayOptInfo*)this,nd</ExpandedItem>
<ExpandedItem Condition="optType == LcOptInfo::OptType::LcMdArray">(LcMdArrayOptInfo*)this,nd</ExpandedItem>
</Expand>
</Type>

</AutoVisualizer>
5 changes: 1 addition & 4 deletions src/coreclr/jit/codegen.h
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,7 @@ class CodeGen final : public CodeGenInterface
unsigned genCurDispOffset;

static const char* genInsName(instruction ins);
const char* genInsDisplayName(emitter::instrDesc* id);
#endif // DEBUG

//-------------------------------------------------------------------------
Expand Down Expand Up @@ -1503,10 +1504,6 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

void instGen_Store_Reg_Into_Lcl(var_types dstType, regNumber srcReg, int varNum, int offs);

#ifdef DEBUG
void __cdecl instDisp(instruction ins, bool noNL, const char* fmt, ...);
#endif

#ifdef TARGET_XARCH
instruction genMapShiftInsToShiftByConstantIns(instruction ins, int shiftByValue);
#endif // TARGET_XARCH
Expand Down
7 changes: 3 additions & 4 deletions src/coreclr/jit/codegencommon.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1070,7 +1070,7 @@ void CodeGen::genDefineTempLabel(BasicBlock* label)
{
genLogLabel(label);
label->bbEmitCookie = GetEmitter()->emitAddLabel(gcInfo.gcVarPtrSetCur, gcInfo.gcRegGCrefSetCur,
gcInfo.gcRegByrefSetCur, false DEBUG_ARG(label->bbNum));
gcInfo.gcRegByrefSetCur, false DEBUG_ARG(label));
}

// genDefineInlineTempLabel: Define an inline label that does not affect the GC
Expand Down Expand Up @@ -2064,9 +2064,8 @@ void CodeGen::genInsertNopForUnwinder(BasicBlock* block)
// block starts an EH region. If we pointed the existing bbEmitCookie here, then the NOP
// would be executed, which we would prefer not to do.

block->bbUnwindNopEmitCookie =
GetEmitter()->emitAddLabel(gcInfo.gcVarPtrSetCur, gcInfo.gcRegGCrefSetCur, gcInfo.gcRegByrefSetCur,
false DEBUG_ARG(block->bbNum));
block->bbUnwindNopEmitCookie = GetEmitter()->emitAddLabel(gcInfo.gcVarPtrSetCur, gcInfo.gcRegGCrefSetCur,
gcInfo.gcRegByrefSetCur, false DEBUG_ARG(block));

instGen(INS_nop);
}
Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/jit/codegenlinear.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ void CodeGen::genCodeForBBlist()
// Mark a label and update the current set of live GC refs

block->bbEmitCookie = GetEmitter()->emitAddLabel(gcInfo.gcVarPtrSetCur, gcInfo.gcRegGCrefSetCur,
gcInfo.gcRegByrefSetCur, false DEBUG_ARG(block->bbNum));
gcInfo.gcRegByrefSetCur, false DEBUG_ARG(block));
}

if (block == compiler->fgFirstColdBlock)
Expand Down
6 changes: 6 additions & 0 deletions src/coreclr/jit/compiler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2988,6 +2988,7 @@ void Compiler::compInitOptions(JitFlags* jitFlags)
opts.disAsmSpilled = false;
opts.disDiffable = false;
opts.disAddr = false;
opts.disAlignment = false;
opts.dspCode = false;
opts.dspEHTable = false;
opts.dspDebugInfo = false;
Expand Down Expand Up @@ -3136,6 +3137,11 @@ void Compiler::compInitOptions(JitFlags* jitFlags)
opts.disAddr = true;
}

if (JitConfig.JitDasmWithAlignmentBoundaries() != 0)
{
opts.disAlignment = true;
}

if (JitConfig.JitLongAddress() != 0)
{
opts.compLongAddress = true;
Expand Down
65 changes: 50 additions & 15 deletions src/coreclr/jit/compiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -6237,9 +6237,9 @@ class Compiler
public:
void optInit();

GenTree* Compiler::optRemoveRangeCheck(GenTreeBoundsChk* check, GenTree* comma, Statement* stmt);
GenTree* Compiler::optRemoveStandaloneRangeCheck(GenTreeBoundsChk* check, Statement* stmt);
void Compiler::optRemoveCommaBasedRangeCheck(GenTree* comma, Statement* stmt);
GenTree* optRemoveRangeCheck(GenTreeBoundsChk* check, GenTree* comma, Statement* stmt);
GenTree* optRemoveStandaloneRangeCheck(GenTreeBoundsChk* check, Statement* stmt);
void optRemoveCommaBasedRangeCheck(GenTree* comma, Statement* stmt);
bool optIsRangeCheckRemovable(GenTree* tree);

protected:
Expand Down Expand Up @@ -6728,7 +6728,7 @@ class Compiler
// BitVec trait information for computing CSE availability using the CSE_DataFlow algorithm.
// Two bits are allocated per CSE candidate to compute CSE availability
// plus an extra bit to handle the initial unvisited case.
// (See CSE_DataFlow::EndMerge for an explaination of why this is necessary)
// (See CSE_DataFlow::EndMerge for an explanation of why this is necessary.)
//
// The two bits per CSE candidate have the following meanings:
// 11 - The CSE is available, and is also available when considering calls as killing availability.
Expand All @@ -6738,6 +6738,37 @@ class Compiler
//
BitVecTraits* cseLivenessTraits;

//-----------------------------------------------------------------------------------------------------------------
// getCSEnum2bit: Return the normalized index to use in the EXPSET_TP for the CSE with the given CSE index.
// Each GenTree has a `gtCSEnum` field. Zero is reserved to mean this node is not a CSE, positive values indicate
// CSE uses, and negative values indicate CSE defs. The caller must pass a non-zero positive value, as from
// GET_CSE_INDEX().
//
static unsigned genCSEnum2bit(unsigned CSEnum)
{
assert((CSEnum > 0) && (CSEnum <= MAX_CSE_CNT));
return CSEnum - 1;
}

//-----------------------------------------------------------------------------------------------------------------
// getCSEAvailBit: Return the bit used by CSE dataflow sets (bbCseGen, etc.) for the availability bit for a CSE.
//
static unsigned getCSEAvailBit(unsigned CSEnum)
{
return genCSEnum2bit(CSEnum) * 2;
}

//-----------------------------------------------------------------------------------------------------------------
// getCSEAvailCrossCallBit: Return the bit used by CSE dataflow sets (bbCseGen, etc.) for the availability bit
// for a CSE considering calls as killing availability bit (see description above).
//
static unsigned getCSEAvailCrossCallBit(unsigned CSEnum)
{
return getCSEAvailBit(CSEnum) + 1;
}

void optPrintCSEDataFlowSet(EXPSET_VALARG_TP cseDataFlowSet, bool includeBits = true);

EXPSET_TP cseCallKillsMask; // Computed once - A mask that is used to kill available CSEs at callsites

/* Generic list of nodes - used by the CSE logic */
Expand Down Expand Up @@ -6872,26 +6903,28 @@ class Compiler
return (enckey & ~TARGET_SIGN_BIT) << CSE_CONST_SHARED_LOW_BITS;
}

/**************************************************************************
* Value Number based CSEs
*************************************************************************/
/**************************************************************************
* Value Number based CSEs
*************************************************************************/

// String to use for formatting CSE numbers. Note that this is the positive number, e.g., from GET_CSE_INDEX().
#define FMT_CSE "CSE #%02u"

public:
void optOptimizeValnumCSEs();

protected:
void optValnumCSE_Init();
unsigned optValnumCSE_Index(GenTree* tree, Statement* stmt);
unsigned optValnumCSE_Locate();
void optValnumCSE_InitDataFlow();
void optValnumCSE_DataFlow();
void optValnumCSE_Availablity();
void optValnumCSE_Heuristic();
bool optValnumCSE_Locate();
void optValnumCSE_InitDataFlow();
void optValnumCSE_DataFlow();
void optValnumCSE_Availablity();
void optValnumCSE_Heuristic();

bool optDoCSE; // True when we have found a duplicate CSE tree
bool optValnumCSE_phase; // True when we are executing the optValnumCSE_phase
unsigned optCSECandidateTotal; // Grand total of CSE candidates for both Lexical and ValNum
unsigned optCSECandidateCount; // Count of CSE's candidates, reset for Lexical and ValNum CSE's
bool optValnumCSE_phase; // True when we are executing the optOptimizeValnumCSEs() phase
unsigned optCSECandidateCount; // Count of CSE's candidates
unsigned optCSEstart; // The first local variable number that is a CSE
unsigned optCSEcount; // The total count of CSE's introduced.
BasicBlock::weight_t optCSEweight; // The weight of the current block when we are doing PerformCSE
Expand All @@ -6916,6 +6949,7 @@ class Compiler
bool optConfigDisableCSE();
bool optConfigDisableCSE2();
#endif

void optOptimizeCSEs();

struct isVarAssgDsc
Expand Down Expand Up @@ -9337,6 +9371,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
bool disasmWithGC; // Display GC info interleaved with disassembly.
bool disDiffable; // Makes the Disassembly code 'diff-able'
bool disAddr; // Display process address next to each instruction in disassembly code
bool disAlignment; // Display alignment boundaries in disassembly code
bool disAsm2; // Display native code after it is generated using external disassembler
bool dspOrder; // Display names of each of the methods that we ngen/jit
bool dspUnwind; // Display the unwind info output
Expand Down
19 changes: 0 additions & 19 deletions src/coreclr/jit/compiler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -763,25 +763,6 @@ inline double getR8LittleEndian(const BYTE* ptr)
return *(double*)&val;
}

/*****************************************************************************
*
* Return the normalized index to use in the EXPSET_TP for the CSE with
* the given CSE index.
* Each GenTree has the following field:
* signed char gtCSEnum; // 0 or the CSE index (negated if def)
* So zero is reserved to mean this node is not a CSE
* and postive values indicate CSE uses and negative values indicate CSE defs.
* The caller of this method must pass a non-zero postive value.
* This precondition is checked by the assert on the first line of this method.
*/

inline unsigned int genCSEnum2bit(unsigned index)
{
assert((index > 0) && (index <= EXPSET_SZ));

return (index - 1);
}

#ifdef DEBUG
const char* genES2str(BitVecTraits* traits, EXPSET_TP set);
const char* refCntWtd2str(BasicBlock::weight_t refCntWtd);
Expand Down
Loading

0 comments on commit 02ccdac

Please sign in to comment.