Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AggressiveInlining to Double.CompareTo() and Single.CompareTo() #56501

Merged
merged 3 commits into from
Aug 20, 2021

Conversation

rickbrew
Copy link
Contributor

As per discussion in #56493

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 28, 2021
@dnfadmin
Copy link

dnfadmin commented Jul 28, 2021

CLA assistant check
All CLA requirements met.

@stephentoub
Copy link
Member

cc: @tannergooding, @GrabYourPitchforks

@ghost
Copy link

ghost commented Jul 29, 2021

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

As per discussion in #56493

Author: rickbrew
Assignees: -
Labels:

area-System.Runtime, community-contribution

Milestone: -

@stephentoub stephentoub added the tenet-performance Performance related issue label Jul 29, 2021
@GrabYourPitchforks
Copy link
Member

I agree with Tanner that special-casing this in the JIT to emit specialized instructions would result in the best codegen. I don't know how difficult that would be in practice, as you don't want to end up in a situation where we're really effective at setting eax to the appropriate value, but where we still incur the series of jumps after the fact. Ideally you'd want the JIT to be able to generate jumps directly to the applicable targets.

@stephentoub
Copy link
Member

Sounds like the general consensus is this should be merged as-is but there should be an issue for .NET 7 tracking making CompareTo an intrinsic?

@rickbrew
Copy link
Contributor Author

Sounds like the general consensus is this should be merged as-is but there should be an issue for .NET 7 tracking making CompareTo an intrinsic?

Do you want to changes you suggested above with splitting NaN handling into its own method?

@stephentoub
Copy link
Member

I'll defer to Tanner/Levi's judgement on that.

@rickbrew
Copy link
Contributor Author

If the method is split, as you suggested, I believe performance would remain about the same for datasets with large amounts (>99%, although I'm guessing/estimating here) of NaN values, although codegen size would increase due to inlining the first part of the method. But, sorting a list of all NaNs seems like an uncommon/pathological corner case that doesn't need to be considered w.r.t. optimization? @tannergooding might know better though.

@tannergooding
Copy link
Member

tannergooding commented Jul 30, 2021

I agree that even when you have NaN it's likely to be more of a corner case than not. However, I would like to see more numbers/data of how this impacts sparse data sets or other scenarios first.

Likewise, I think that overall the best thing is to get the JIT to treat this intrinsicly. Today we have:

if (m_value < value) return -1;
        if (m_value > value) return 1;
        if (m_value == value) return 0;

        // At least one of the values is NaN.
        if (double.IsNaN(m_value))
            return double.IsNaN(value) ? 0 : -1;
        else
            return 1;

which generates assembly that is effectively:

    L0000: vzeroupper
    L0003: vmovsd xmm0, [rcx]
    L0007: vucomisd xmm1, xmm0
    L000b: ja short L0029
    L000d: vucomisd xmm0, xmm1
    L0011: ja short L0032
    L0013: vucomisd xmm0, xmm1
    L0017: jp short L001b
    L0019: je short L002f
    L001b: vucomisd xmm0, xmm0
    L001f: jp short L0023
    L0021: je short L0032
    L0023: vucomisd xmm1, xmm1
    L0027: jp short L002f
    L0029: mov eax, 0xffffffff
    L002e: ret
    L002f: xor eax, eax
    L0031: ret
    L0032: mov eax, 1
    L0037: ret

While in practice, we only need no more than 3 ucomisd (1 if neither input is NaN) and then a brief set of jumps and we could generate something more like:

entry:
  xor      eax,  eax     ; Clear result to zero
  vmovsd   xmm0, [rcx]   ; Load "this" into xmm0
  vucomisd xmm0, xmm1    ; Compare "this" to "value"
  jp       nan           ; Parity flag is set, one input is NaN (unpredicted)
  jc       less_than     ; Carry flag is set, "this" is less than "value"
  jnz      greater_than  ; Zero flag is not set, "this" is greater than "value"
  ret                    ; Return value is already zero (inputs are equal)
greater_than:
  inc      eax           ; Return value should be one
  ret
less_than:
  dec      eax           ; Return value should be negative one
  ret
nan:
  vucomisd xmm0, xmm0    ; Compare "this" to "this"
  jnp      greater_than  ; Parity flag is not set, "this" is not NaN; "value" is a NaN
  vucomisd xmm1, xmm1    ; Compare "value" to "value"
  jnp      less_than     ; Parity flag is not set, "this" is NaN; "value" is not a NaN
  ret                    ; Return value is already zero (inputs are both NaN)

This would result in approx. 36 bytes of assembly which is down from 56 bytes. It has significantly fewer comparisons which saves us a number of cycles (~3-7 cycles per ucomis on modern CPUs) and many fewer branches in the "tight" window which should improve the branch predictor throughput. -- This could probably be even more efficient and branch predictor friendly with CMOVcc, but we don't generate those today.

Intrinsicly recognizing CompareTo might also be better for scenarios where users are relying on CompareTo for comparison, as we could then correctly optimize something like if (value.CompareTo(other) < 0) to be just value < other and not generate the rest of the code. Although I'm not sure how useful that would be in practice.

Likewise, even without intrinsifying, we should get the JIT to understand cases like:

  vucomisd xmm0, xmm0
  jp short label1
  je short label2
label1:
  vucomisd xmm1, xmm1
  ; more code
label2:
  ; more code

and have it generate:

  vucomisd xmm0, xmm0
  je short label2
  vucomisd xmm1, xmm1
  ; more code
label2:
  ; more code

@danmoseley
Copy link
Member

@rickbrew it sounds like the next action here is to do some quick perf tests including lots of NaN's.

@tannergooding do we not have any perf tests for comparing numbers? Maybe I'm missing them..
https:/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Runtime/Perf.Double.cs

@danmoseley
Copy link
Member

@rickbrew do you expect to be able to gather this data in time to possibly get this into 6.0?

@rickbrew
Copy link
Contributor Author

@danmoseley Cutoff for .NET 6 is 8/17? No, that won't be possible for me.

I also wasn't sure if that was up to me or @tannergooding, since #56493 was assigned to him. Chalk it up to being a first-timer in the dotnet/runtime repo.

I can still take a look in a bit, I've been very busy finishing, preparing, and stabilizing Paint.NET 4.3 (finally ported to .NET 5)

@danmoseley
Copy link
Member

Congratulations on the port! Do you notice any differences compared to running it on .NET Framework?

@rickbrew
Copy link
Contributor Author

rickbrew commented Aug 15, 2021

Oh yes, lots. It's faster, maybe by about 15% across the board before doing any further optimizations that weren't possible with Framework.. Except for startup performance which has about a 30% penalty due to using a lot of C++/CLI code, which can't be crossgen'd. I interop heavily with the likes of D2D, WIC, et. al. That will slowly evaporate/improve as I port all of that over to C# (@tannergooding 's TerraFX will be put to good use).

SCD is fantastic for simplifying and alleviating a lot of various install issues. Installation and updating is much faster because I crossgen on my build machine instead of NGEN on the user's system. I've also been able to sink a lot of time into SIMD optimizations, both to add newly optimized code paths and to port many others from native C code that I was p/invoking. ARM64 support is great, really glad that landed in 5.0.9 for WinForms and WPF. Being able to load plugins into isolated AssemblyLoadContexts is a good thing and solves a lot of issues.

Plugin compatibility remains a struggle, although most work fine as-is w/o recompilation (including super old .NET 2.0 or even 1.1 DLLs). Some framework DLLs have been removed (System.Windows.Forms.DataVisualization, WCF), some classes have been removed (Menu and ContextMenu), and a few things have bugs in them (XmlSerializer dies when used in a collectible AssemblyLoadContext, but that's fixed in .NET 6 and I have a workaround). I'm steadily working through as many of those as possible with various workarounds, fixes, or even hot patching via Mono.Cecil. Some plugin authors will be putting out new updates, while others have wandered elsewhere in life and so their plugins will need app-side fixes/workarounds/shims.

Download size is also a struggle. SCD grows my installer from ~12MB to >50MB, so hopefully my hosting provider doesn't get stressed out. I'm also actively pursuing ways to trim the size through various creative methods (e.g. #56699).

Having access to all the new stuff in the runtime and framework, more libraries on nuget, and the massively improved JIT/codegen is really great. It finally feels like I'm working with the runtime now, instead of working around it.

@danmoseley
Copy link
Member

That's a great result so far thank you for sharing cc @richlander although I'm guessing he's aware of your progress.

@danmoseley
Copy link
Member

@tannergooding i will be out next week. Do you think it would be reasonable to take this change Monday with the evidence we have? It seems we have daily good confidence it will on balance be a benefit. Your call.

@danmoseley
Copy link
Member

danmoseley commented Aug 15, 2021

One other question @rickbrew as it wasn't clear - have you experimented with trimming before crossgen? We did a lot of work to make the shared framework safe to trim. WPF/Winforms are not as you likely know. Cc @eerhardt

@rickbrew
Copy link
Contributor Author

rickbrew commented Aug 15, 2021

@danmoseley I haven't tried trimming just yet, partly because I couldn't get ILLink's command-line parameters to work. But also, I suspect PDN's trimmability is low because of the plugin situation. They need access to both PDN and runtime/framework DLLs. I may look into it more later, after 4.3's release, maybe after .NET 6's release. I see value in being able to trim libraries like Newtonsoft.Json.dll and will definitely need it when I start using TerraFX.Interop.dll (8MB!), neither of which plugins would be using (and they can have their own private copy now anyway, thanks to AseemblyLoadContext).

Also, the biggest bang-for-buck action I can see right now is trimming out R2R data from large framework DLLs that are not used, or not used enough, at least not on the startup path (e.g. PresentationFramework.dll is 17MB, but only 6MB w/o R2R data, see also #56699). An uncrossgen utility would be really helpful to enable an easy, impactful win.

@rickbrew
Copy link
Contributor Author

rickbrew commented Aug 15, 2021

I decided to start running benchmarks and gathering the data anyway, as I could use a break from the PDN stuff.

Anyway the first result I'm seeing is that @stephentoub 's "split inlined" idea is actually faster than the fully inlined version, at least on NaN-less data. Still fleshing out the benchmark to get data on various mixtures of NaNs in the data, but I wanted to share that first before anything got merged. nope, that wasn't necessarily right, I'll wait until I've got enough data to profess any actual conclusions 😂

@rickbrew
Copy link
Contributor Author

rickbrew commented Aug 15, 2021

I've created some benchmark code in this repository: https:/rickbrew/CompareInliningBenchmarks

VS 2022 17.0.0 Preview 3.0
Ryzen 5950X, PBO enabled, 64GB DDR4-3600 RAM
Because PBO (precision boost overdrive, aka auto-overclocking) is temperature sensitive, I'm running the CPU fans at max to help it run at a consistent clock speed. Otherwise it might run at a slower speed later in the benchmark run. There's also a portable AC unit in the room.

The benchmark consists of creating an array with several million elements, sized so as to fit just within the 32MB L3 cache available to each chiplet on the CPU. The array is filled with values 0 through N-1 (ints cast to float/double). The NaNPercentage determines how many NaNs are evenly spread throughout the array. Then the array is shuffled using Fisher-Yates. The random seed is the same every time, for both floats and doubles.

The implementation of Sort() is Timsort, pulled from my Paint.NET codebase, which has full interfaces+structs+generics+inlining. It is not the exact same as the framework's Sort(), which I'm told boxes the comparer (which would be very bad for this benchmark).

Here are the results for doubles with 4M elements in the array:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-preview.7.21379.14
  [Host]     : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT
  Job-JXESFV : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT
Method NaNPercentage Mean Error StdDev Median Ratio
SortStandard 0 296.9 ms 2.36 ms 2.20 ms 297.6 ms 1.00
SortFullyInlined 0 192.4 ms 3.58 ms 3.35 ms 191.4 ms 0.65
SortSplitInlined 0 197.1 ms 2.74 ms 2.57 ms 197.2 ms 0.66
SortStandard 10 299.1 ms 2.45 ms 2.29 ms 299.6 ms 1.00
SortFullyInlined 10 193.1 ms 3.20 ms 2.99 ms 192.6 ms 0.65
SortSplitInlined 10 197.4 ms 3.03 ms 2.84 ms 196.3 ms 0.66
SortStandard 20 295.4 ms 4.62 ms 4.33 ms 295.2 ms 1.00
SortFullyInlined 20 187.6 ms 3.65 ms 4.21 ms 186.8 ms 0.64
SortSplitInlined 20 192.1 ms 2.60 ms 2.43 ms 192.2 ms 0.65
SortStandard 30 293.2 ms 5.50 ms 5.40 ms 291.7 ms 1.00
SortFullyInlined 30 177.0 ms 3.54 ms 7.14 ms 173.9 ms 0.63
SortSplitInlined 30 192.1 ms 3.82 ms 6.69 ms 189.9 ms 0.68
SortStandard 40 275.6 ms 2.54 ms 2.26 ms 275.7 ms 1.00
SortFullyInlined 40 166.0 ms 3.29 ms 5.94 ms 163.9 ms 0.63
SortSplitInlined 40 184.7 ms 3.62 ms 4.17 ms 184.7 ms 0.68
SortStandard 50 271.7 ms 5.40 ms 10.27 ms 268.9 ms 1.00
SortFullyInlined 50 158.7 ms 3.17 ms 6.69 ms 156.7 ms 0.59
SortSplitInlined 50 180.1 ms 3.60 ms 7.18 ms 177.6 ms 0.67

And for floats with 8M elements in the array. Same amount of memory and bandwidth, but higher compute cost:

Method NaNPercentage Mean Error StdDev Ratio
SortStandard 0 867.2 ms 1.69 ms 1.49 ms 1.00
SortFullyInlined 0 640.1 ms 2.66 ms 2.22 ms 0.74
SortSplitInlined 0 670.8 ms 3.08 ms 2.88 ms 0.77
SortStandard 10 827.5 ms 5.48 ms 5.13 ms 1.00
SortFullyInlined 10 642.7 ms 3.07 ms 2.87 ms 0.78
SortSplitInlined 10 661.3 ms 3.64 ms 3.40 ms 0.80
SortStandard 20 808.0 ms 3.01 ms 2.82 ms 1.00
SortFullyInlined 20 593.6 ms 2.23 ms 2.08 ms 0.73
SortSplitInlined 20 643.8 ms 3.05 ms 2.85 ms 0.80
SortStandard 30 748.6 ms 2.88 ms 2.41 ms 1.00
SortFullyInlined 30 565.9 ms 2.54 ms 2.38 ms 0.76
SortSplitInlined 30 617.4 ms 3.49 ms 3.26 ms 0.82
SortStandard 40 726.1 ms 5.10 ms 4.77 ms 1.00
SortFullyInlined 40 504.0 ms 1.86 ms 1.74 ms 0.69
SortSplitInlined 40 576.4 ms 2.98 ms 2.64 ms 0.79
SortStandard 50 671.1 ms 6.76 ms 6.32 ms 1.00
SortFullyInlined 50 456.7 ms 1.51 ms 1.34 ms 0.68
SortSplitInlined 50 548.9 ms 2.18 ms 1.82 ms 0.82

I'm not sure that having 50% NaNs (or even 10%!) is a realistic scenario, but it does illustrate some divergence in results between fully- and split-inlined.

I didn't go above 50% NaNs because by the time you reach 100% NaNs the data is fully sorted and then what's the point of sorting, and it didn't seem like a realistic scenario either.

@rickbrew
Copy link
Contributor Author

So this illustrates that for sorting an array of elements, inlining CompareTo() has great results. It's hard to say what the impact would be in other scenarios that have more/less reliance on CompareTo() for their performance.

@tannergooding
Copy link
Member

Given the above, and that split inlining is still a lot faster than no inlining, I think its reasonable as a stop gap.

@danmoseley, do you have any concerns with us merging this for RC1? If the official benchmarks show any regression not covered here or on older hardware, we'd likely need to pull it back out.


private int CompareToWithAtLeastOneNaN(double value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the manual split inlining actually make the code significantly smaller in real world situations vs. just aggressive inlining the whole thing? The call introduced by the split inlining is going to have secondary effect like extra register spilling, so it is not obvious to me that it is actually profitable.

Copy link
Member

@tannergooding tannergooding Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire inlined assembly is fairly large: #56501 (comment)

NaN checks in particular aren't done very efficiently today and we should try to improve it in .NET 7. So with the partial inlining, the call will remove 2-4 extra branches from the inlined code-path (in favor of a call which may spill on a probably unlikely path).

In the link, I gave an example of what we could generate for this entire comparison if we did something "more optimally".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic or even simply improve the general handling for these kinds of checks and branches.

Copy link
Member

@jkotas jkotas Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYmpBkIYBvGgzNNdxFAwCyACgCUx0+YC+Nd9S27cGKBzAMBgAVVQwaE2pzJnIkBmAICEkGAElcWWgMGAATAH0RcWxYYIgAMQ5JSQBPFK5JXi4cuzYYbGyAeTqq2TFsLgAebIgOYGkAPgZsBxczSOjogDNoBjsGoN4GAF4GOjUGDf7JlgAZGC4AcwwACwY4BnI93gBqJ+mo+Y/eBZWC3uKyirVWr1RrZOzYADavAAumhJlCGE97tCnIc6E5iAB2BgLbCSXAwDTvD5Yhh+DiEmYMVzRKlUiE2GDXCDZFLiSR2RnM1ns9piPgQLi4FgAQXO51guFwvAUMGBDQa5wc0KpOhicTWBlEfxgJXKlRqdQaTQaDAAIsNRjAGPhcgo8RS4RaRtIGPbJBS3tE5h99t87Lb3RSGIcgzAMdi4A8qdEvitAw7rRMwxH7kTfX743bE1ttimmNjdjHzAB6EsMEVBaR4IKC60Qb7Xa1h3D7VsAOWw7ZYxbMcbsQxdgjSnfbAezHvDXoz5lJg6tLBHXbs+YA/DsGCBbtHifMYPiYL35qSd9FPLTd8w4gkkql0pkcvltUVdRAevUMPLQc1Wh0uj0+kGS1xkmadZiPJYoBWTUNm2XZ9hDI5Tgua5bjTfYXjAmd+1+F8SnfXhPyNb9IRhOFSMRZFUR2VNcQPdNfVJclKV3GlzE0XcGSZK4WTZMQOS5HieX4vkBSFUVxUlaVZS/RVlVVSxYn2HgtUKf4CKIkETS4c1gOtBNJzhedXRTKkfU+f0DODUNE1TKMGIsrMwwYZNbILNMj37KzrU2PM3NJItd2iUlcP+AB1QirkrU4a06GBR3HMM4RTBzqQ46IxCgGVsCydVlKCULXwi65otaXw4oS01nStG0J0dBhjObWyzKPMsKyrMra0aBgGzJK4msnVteA7LseyC8x+0axdcAS7yHCwjM5z06aErXDct3so99wJI9guxU83A8IA=

shows the behavior of the two options in a simple real world case. Notice that the IsSorted_CompareToFullyInlined (code size 0x59) produces smaller code than IsSorted_CompareToSplitInlined (code size 0x6f). The obvious problem is that the slow path is a good inlining candidate and the JIT decides to inline it. Once you try to fix it by marking the slow path with NoInlining:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYmpBkIYBvGgzNNdxFAwCyACgCUx0+YC+Nd9S27cGKBzAMBgAVVQwaE2pzJnIkBmAICEkGAElcWWgMGAATAH0RcWxYYIgAMQ5JSQBPFK5JXi4cuzYYbGyAeTqq2TFsLgAebIgOYGkAPgZsBxczSOjogDNoBjsGoN4GAF4GOjUGDf7JlgAZGC4AcwwACwY4BnI93gBqJ+mo+Y/eBZWC3uKyirVWr1RrZOzYADavAAumhJlCGE97tCnIc6E5iAB2BgLbCSXAwDTvD5Yhh+DiEmYMVzRKlUiE2GDXCDZFLiSR2RnM1ns9piPgQLi4FgAQXO51guFwvAUMGBDQa5wc0KpOhicTWBlEfxgJXKlRqdQaTQaDAAIsNRjAGPhcgo8RS4RaRtIGPbJBS3tE5h99t87Lb3RSGIcgzAMdi4A8qdEvitAw7rRMwxH7kTfX743bE1ttimmNjdjHzAB6EsMEVBaR4IKC60Qb7Xa1h3D7VsAOWw7ZYxbMcbsQxdgjSnfbAezHvDXoz5lJg6tLBHXbs+YA/DsGCBbtHifMYPiYL35qSd9FPLTd8w4gkkql0pkcvltUVdRAevUMPLQc1Wh0uj0+kGS1xkmadZiPJYoBWTUNm2XZ9hDI5Tgua5bjTfYXjAmd+1+F8SnfXhPyNb9IRhOFSMRZFUR2VNcQPdNfVJclKV3GlzE0XcGSZK4WTZMQOS5HieX4vkBSFUVxUlaVZS/RVlVVSxYn2HgtUKf4CKIkETS4c1gOtBNJzhedXRTKkfU+f0DODUNE1TKMGIsrMwwYZNbILNMj37KzrU2PM3NJItd2iUlcP+AB1QirkrU4a06GBR3HMM4RTBzqQ46IuO5PiBO43jeX5XhBWFdsIFki55N3MQoBlbAsnVZSglC18IuuaLWl8OKEtNZ0rRtCdHQYYzm1ssyjzLCsq3a2tGgYBsySuYbJ1bXgOy7HsgvMfshsXXAEu8hwsIzOc9J2hK1w3Ld7KPfcCSPYLsVPNwPCAA

it gets a bit better, but the code size of the non-split inlined version is still smaller.

I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic

I agree that it would be nice to teach the JIT to deal with the efficiently. If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.

Copy link
Member

@tannergooding tannergooding Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.

I'm fine with this. I would like to disagree on the examples adequately demonstrating the difference however as smaller code isn't necessarily better.

There are many factors that can impact perf here including the likelihood that a given path will be hit (NaNs are likely rare) and considerations such as the number of branches in a "small window" (16-32 aligned bytes of assembly). This was why I was initially hesitant about the change as it could regress certain "hot loops" due to the additional branches (7) that would now be present directly in the loop. Of course, the partial inlining could also impact this in interesting ways with its overall larger codegen but it does reduce the branch count by 3 compared to full inlining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that you would be likely able to come up with cases where the split inlining is better due for micro-architecture reasons, if you tried hard enough. The data we have so far in this thread is that a simple AggressiveInlining produces faster and smaller code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To satisfy my own curiosity, I tried it in double.cs in corelib. Here main is what's currently checked in, pr1 is CompareTo aggressively inlined, and pr2 is it split.

private double[] _doubles = Enumerable.Range(0, 1_000_000).Select(i => i * 1.23).ToArray();
private double[] _scratch = new double[1_000_000];

[Benchmark]
public bool IsSorted()
{
    ReadOnlySpan<double> a = _doubles;
    for (int i = 0; i < a.Length - 1; i++)
        if (a[i].CompareTo(a[i + 1]) > 0)
            return false;
    return true;
}

[Benchmark]
public void CopyAndSort()
{
    _doubles.CopyTo(_scratch, 0);
    Array.Sort(_scratch);
}

[Benchmark]
public int Search() => Array.BinarySearch(_doubles, 2_000_000);

[Benchmark]
public int CompareSequence() => _doubles.AsSpan().SequenceCompareTo(_doubles);
Method Toolchain Mean Ratio Code Size
IsSorted main\corerun.exe 1,921,366.71 ns 1.00 155 B
IsSorted pr1\corerun.exe 739,726.37 ns 0.39 97 B
IsSorted pr2\corerun.exe 970,601.44 ns 0.50 117 B
CopyAndSort main\corerun.exe 11,761,377.29 ns 1.00 1,060 B
CopyAndSort pr1\corerun.exe 11,810,071.67 ns 1.00 1,060 B
CopyAndSort pr2\corerun.exe 11,837,290.42 ns 1.01 1,060 B
Search main\corerun.exe 58.53 ns 1.00 207 B
Search pr1\corerun.exe 35.51 ns 0.61 207 B
Search pr2\corerun.exe 38.65 ns 0.66 207 B
CompareSequence main\corerun.exe 1,829,633.22 ns 1.00 330 B
CompareSequence pr1\corerun.exe 1,452,953.83 ns 0.79 306 B
CompareSequence pr2\corerun.exe 1,523,349.48 ns 0.83 316 B

I asked whether it would make sense to split it. I'm fine with the answer being "no" 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I have reverted the commit that split the inlining, so we're back to just regular aggressive inlined

@danmoseley
Copy link
Member

@danmoseley, do you have any concerns with us merging this for RC1? If the official benchmarks show any regression not covered here or on older hardware, we'd likely need to pull it back out

Nope, it'd very low risk if we have to pull it out. If you've concluded it's a worthwhile change in the real world which it sounds like you have.

@tannergooding tannergooding added this to the 7.0.0 milestone Aug 19, 2021
@ghost
Copy link

ghost commented Aug 19, 2021

Hello @tannergooding!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@stephentoub
Copy link
Member

@tannergooding, is the plan to backport this?

@jeffhandley
Copy link
Member

@stephentoub Yep. @tannergooding and I chatted about it earlier. Before we merge into release/6.0 though, I'd like to make sure the latest perf run against main doesn't indicate any unintended regressions (even though likelihood is low).

@GrabYourPitchforks
Copy link
Member

Before we merge into release/6.0 though, I'd like to make sure the latest perf run against main doesn't indicate any unintended regressions

As @rickbrew mentioned earlier, we may see a minor size-on-disk regression as would accompany most "this method is now inlined where previously it wasn't" changes. I assume we're still ok with that?

@jeffhandley
Copy link
Member

Assuming it's truly minor, yeah.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants