Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip FP registers processing if method don't have floating points use #85744

Merged
merged 4 commits into from
May 6, 2023

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented May 4, 2023

Fixes: #83109
Contributes to #83946

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 4, 2023
@ghost ghost assigned kunalspathak May 4, 2023
@ghost
Copy link

ghost commented May 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@@ -733,6 +733,7 @@ class LinearScan : public LinearScanInterface
unsigned int currentSpill[TYP_COUNT];
bool needFloatTmpForFPCall;
bool needDoubleTmpForFPCall;
bool needFloatingRegisters;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A safe change would be to have needFloatingRegisters |= (theRegisterType != IntRegisterType); in newInterval() method but want to see if my current changes achieve the purpose.

@kunalspathak kunalspathak changed the title Skip processing floating point registers if method doesn't have one Skip FP registers processing if method don't have floating points use May 4, 2023
@kunalspathak
Copy link
Member Author

Nice TP Diffs

image

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstress-jitstressregs, runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding
Copy link
Member

tannergooding commented May 4, 2023

Nice TP Diffs

Do we know what causes the potential hit to Arm64 minopts? Seems odd that we see such huge improvements elsewhere, but not always in minopts but only on Arm64

I see tests in particular are the "worst offender", is that maybe due to the number of HWIntrinsic tests? If so, do we know why we don't see similar in full-opts?

@kunalspathak
Copy link
Member Author

Jitstress is green.

@kunalspathak
Copy link
Member Author

I see tests in particular are the "worst offender", is that maybe due to the number of HWIntrinsic tests? If so, do we know why we don't see similar in full-opts?

I will try to find it out.

@kunalspathak kunalspathak marked this pull request as ready for review May 4, 2023 05:53
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@kunalspathak
Copy link
Member Author

I see tests in particular are the "worst offender", is that maybe due to the number of HWIntrinsic tests? If so, do we know why we don't see similar in full-opts?

I will try to find it out.

PS E:\spmi\pintools\1.0\windows> .\analyze-pin-trace-diff.ps1 -b E:\features\lsra\floatregs\base_arm64_coreclr.txt E:\features\lsra\floatregs\diff_arm64_coreclr.txt
Base: 431066232654, Diff: 431355053761, +0.0670%

?buildIntervals@LinearScan@@QEAAXXZ                                                                                           : 571725277  : +46.55%  : 27.20% : +0.1326%
?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                                                               : 349854912  : +27.75%  : 16.64% : +0.0812%
?resolveRegisters@LinearScan@@QEAAXXZ                                                                                         : 270767431  : +3.38%   : 12.88% : +0.0628%
?impImportAndPushBox@Compiler@@IEAAXPEAUCORINFO_RESOLVED_TOKEN@@@Z                                                            : 2230836    : +6.41%   : 0.11%  : +0.0005%
?impPushOnStack@Compiler@@IEAAXPEAUGenTree@@VtypeInfo@@@Z                                                                     : -2256082   : -0.75%   : 0.11%  : -0.0005%
?impImportCall@Compiler@@IEAA?AW4var_types@@W4opcode_t@@PEAUCORINFO_RESOLVED_TOKEN@@1PEAUGenTree@@HPEAUCORINFO_CALL_INFO@@I@Z : -2438804   : -0.20%   : 0.12%  : -0.0006%
??$allocateRegisters@$0A@@LinearScan@@QEAAXXZ                                                                                 : -436643544 : -1.57%   : 20.77% : -0.1013%
?buildPhysRegRecords@LinearScan@@AEAAXXZ                                                                                      : -464797849 : -100.00% : 22.11% : -0.1078%

I investigated the top methods and one thing that stands out is that earlier the for-loop of registers was unrolled and now it no longer does, but iterating it fewer times is still better.
image

The call to buildPhysRegsRecord() is now getting inlined inside buildIntervals().

image

Other then that, the only extra thing I am doing is setting needNonIntegerRegisters = true inside internalFloatRegCandidates().

@kunalspathak
Copy link
Member Author

It also occurred to me that for xarch, we already updated the definition of AVAILABLE_REG_COUNT to a getter, but for arm64, it was a constant. This PR makes arm64 definition to getter as well and that could reflect the regression on arm64 and not on x64.

Comment on lines +1949 to +1950
// For that `compFloatingPointUsed` should be set accurately
// before invoking allocator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this bit in particular is a bit difficult as we need to pre-emptively set compFloatingPointUsed for any SIMD code. But that can't really account for dead code elimination or other factors that might later remove or transforms the nodes to no longer need it.

Copy link
Member

@tannergooding tannergooding May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it instead be possible to have regMaskTP LinearScan::internalFloatRegCandidates() lazily do the below, or is there other factors that require it?

That is, have it do something like:

regMaskTP LinearScan::internalFloatRegCandidates()
{
    if (!needNonIntegerRegisters)
    {
        // lazily initialize fpreg state here

        for (unsigned int i = 0; i < lsraRegOrderFltSize; i++)
        {
            regNumber  reg  = lsraRegOrderFlt[i];
            RegRecord* curr = &physRegs[reg];
            curr->regOrder  = (unsigned char)i;
        }

        needNonIntegerRegisters = true;
    }

    if (compiler->compFloatingPointUsed)
    {
        return availableFloatRegs;
    }
    else
    {
        return RBM_FLT_CALLEE_TRASH;
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we need the physRegRecord early right after identifying candidates.

Copy link
Member Author

@kunalspathak kunalspathak May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that can't really account for dead code elimination or other factors that might later remove or transforms the nodes to no longer need it.

True, but this is the best we can do I suppose, unless you are proposing to just start from scratch in LSRA and set it to true during BuildNode() if tree->GetType() != TYP_INT.

Edit: Doing the check in BuildNode() might prove costly though because we will be doing that for every single GenTree* node and not sure how many cases we will optimize for the scenarios where we didn't account for dead code elimination.

@kunalspathak kunalspathak merged commit 8854509 into dotnet:main May 6, 2023
@kunalspathak kunalspathak deleted the lsra-throughput branch May 6, 2023 03:37
@ghost ghost locked as resolved and limited conversation to collaborators Jun 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segregate the register iterations based on intType/floatType
3 participants