Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Port loop alignment to new loop representation #95836

Merged
merged 5 commits into from
Dec 11, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Dec 10, 2023

  • Revive Bruce's old Simplify alignment block marking #62940: do alignment candidate identification during the main alignment phase. Do this using the new loop structures instead.
  • To avoid maintaining loop information into the backend we reidentify loops during the phase, so there is some TP cost. It's minimized by the fact that we can conservatively know whether there might be any loops. Also, the DFS we need to do is likely to be used in the future for block layout.
  • Introduce BBF_OLD_LOOP_HEADER_QUIRK; it is a simpler version of the m_newToOldLoop array that we currently have. The next work item will be to switch all the loop-related quirks onto this flag instead of using m_newToOldLoop.

A few alignment related diffs expected -- sometimes from different insertion of alignment (because the old BasicBlock::bbNatLoopNum was not up-to-date), other times from more alignment because of the weight change (see #62940 for the details on that). Also, we allow compaction of some loop tops now that previously would be rejected from being marked with the "align" flag, so that also causes some diffs.

* Revive Bruce's old dotnet#62940: do alignment candidate identification
  during the main alignment phase. Do this using the new loop structures
  instead.
* To avoid maintaining loop information into the backend we reidentify
  loops during the phase, so there is some TP cost. It's minimized by
  the fact that we can conservatively know whether there might be any
  loops. Also, the DFS we need to do is likely to be used in the future
  for block layout.
* Introduce `BBF_OLD_LOOP_HEADER_QUIRK`; it is a simpler version of the
  `m_newToOldLoop` array that we currently have. The next work item will
  be to switch all the loop-related quirks onto this flag instead of
  using `m_newToOldLoop`.

A few alignment related diffs expected -- sometimes from different
insertion of alignment (because the old `BasicBlock::bbNatLoopNum` was
not up-to-date), other times from more alignment because of the weight
change (see dotnet#62940 for the details on that). Also, we allow compaction
of some loop headers now that previously would be rejected from being
marked with the "align" flag, so that also causes some diffs.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 10, 2023
@ghost ghost assigned jakobbotsch Dec 10, 2023
@ghost
Copy link

ghost commented Dec 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details
  • Revive Bruce's old Simplify alignment block marking #62940: do alignment candidate identification during the main alignment phase. Do this using the new loop structures instead.
  • To avoid maintaining loop information into the backend we reidentify loops during the phase, so there is some TP cost. It's minimized by the fact that we can conservatively know whether there might be any loops. Also, the DFS we need to do is likely to be used in the future for block layout.
  • Introduce BBF_OLD_LOOP_HEADER_QUIRK; it is a simpler version of the m_newToOldLoop array that we currently have. The next work item will be to switch all the loop-related quirks onto this flag instead of using m_newToOldLoop.

A few alignment related diffs expected -- sometimes from different insertion of alignment (because the old BasicBlock::bbNatLoopNum was not up-to-date), other times from more alignment because of the weight change (see #62940 for the details on that). Also, we allow compaction of some loop headers now that previously would be rejected from being marked with the "align" flag, so that also causes some diffs.

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Dec 10, 2023

Detailed tp diff for benchmarks.run_pgo win-x64:

Base: 115920087166, Diff: 116185166330, +0.2287%

88008437  : +21.66%  : 23.81% : +0.0759% : `Compiler::fgRunDfs<`Compiler::fgComputeDfs'::`2'::<lambda_1>,`Compiler::fgComputeDfs'::`2'::<lambda_2> >'::`2'::<lambda_1>::operator()                                                         
44706430  : +65.61%  : 12.09% : +0.0386% : public: static class FlowGraphNaturalLoops * __cdecl FlowGraphNaturalLoops::Find(class FlowGraphDfsTree const *)                                                                                
31255833  : +70.36%  : 8.46%  : +0.0270% : private: static bool __cdecl FlowGraphNaturalLoops::FindNaturalLoopBlocks(class FlowGraphNaturalLoop *, class jitstd::list<struct BasicBlock *, class jitstd::allocator<struct BasicBlock *>> &)
28438047  : +21.39%  : 7.69%  : +0.0245% : BasicBlock::VisitAllSuccs<`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >                                                                                                   
17122462  : +62.60%  : 4.63%  : +0.0148% : public: bool __cdecl FlowGraphNaturalLoop::ContainsBlock(struct BasicBlock *)                                                                                                                   
16611283  : +73.05%  : 4.49%  : +0.0143% : BasicBlock::VisitRegularSuccs<``FlowGraphNaturalLoops::Find'::`4'::<lambda_1>::operator()'::`2'::<lambda_1> >                                                                                   
16530668  : NA       : 4.47%  : +0.0143% : public: void __cdecl Compiler::optIdentifyLoopsForAlignment(class FlowGraphNaturalLoops *, class BlockToNaturalLoopMap *)                                                                       
12342223  : +0.48%   : 3.34%  : +0.0106% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                         
12281199  : +1.06%   : 3.32%  : +0.0106% : public: bool __cdecl Compiler::fgUpdateFlowGraph(bool, bool)                                                                                                                                    
11971741  : +21.53%  : 3.24%  : +0.0103% : VisitEHSuccs<0,`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >                                                                                                              
9970991   : +68.07%  : 2.70%  : +0.0086% : public: static class BlockToNaturalLoopMap * __cdecl BlockToNaturalLoopMap::Build(class FlowGraphNaturalLoops *)                                                                                
9760342   : +24.01%  : 2.64%  : +0.0084% : public: void __cdecl jitstd::vector<struct FlowEdge *, class jitstd::allocator<struct FlowEdge *>>::push_back(struct FlowEdge *const &)                                                         
5158149   : NA       : 1.40%  : +0.0044% : `FlowGraphNaturalLoop::VisitLoopBlocksReversePostOrder<`Compiler::optIdentifyLoopsForAlignment'::`16'::<lambda_1> >'::`2'::<lambda_1>::operator()                                               
4852081   : +11.53%  : 1.31%  : +0.0042% : public: struct FlowEdge * __cdecl Compiler::BlockPredsWithEH(struct BasicBlock *)                                                                                                               
1649730   : +0.98%   : 0.45%  : +0.0014% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                         
1351647   : +12.52%  : 0.37%  : +0.0012% : Compiler::fgRunDfs<`Compiler::fgComputeDfs'::`2'::<lambda_1>,`Compiler::fgComputeDfs'::`2'::<lambda_2> >                                                                                        
872195    : +45.38%  : 0.24%  : +0.0008% : public: enum PhaseStatus __cdecl Compiler::placeLoopAlignInstructions(void)                                                                                                                     
743280    : +11.59%  : 0.20%  : +0.0006% : public: void __cdecl ArrayStack<class AllSuccessorEnumerator>::Emplace<class Compiler *const, struct BasicBlock *&>(class Compiler *const &&, struct BasicBlock *&)                             
617223    : +1.16%   : 0.17%  : +0.0005% : private: void __cdecl Compiler::optComputeLoopSideEffectsOfBlock(struct BasicBlock *, class FlowGraphNaturalLoop *)                                                                             
557460    : +12.13%  : 0.15%  : +0.0005% : public: class FlowGraphDfsTree * __cdecl Compiler::fgComputeDfs(void)                                                                                                                           
551023    : +25.44%  : 0.15%  : +0.0005% : BasicBlock::VisitEHEnclosedHandlerSecondPassSuccs<`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >                                                                           
-484770   : -0.11%   : 0.13%  : -0.0004% : __security_check_cookie                                                                                                                                                                         
-628164   : -2.06%   : 0.17%  : -0.0005% : public: void __cdecl Compiler::optFindLoops(void)                                                                                                                                               
-690205   : -4.59%   : 0.19%  : -0.0006% : public: void __cdecl Compiler::fgRemoveBlock(struct BasicBlock *, bool)                                                                                                                         
-824970   : -0.46%   : 0.22%  : -0.0007% : protected: void __cdecl Compiler::compInitOptions(class JitFlags *)                                                                                                                             
-865299   : -99.38%  : 0.23%  : -0.0007% : HasOldChildLoop                                                                                                                                                                                 
-3163170  : -0.35%   : 0.86%  : -0.0027% : memset                                                                                                                                                                                          
-7689284  : -100.00% : 2.08%  : -0.0066% : protected: void __cdecl Compiler::AddContainsCallAllContainingLoops(class FlowGraphNaturalLoop *)                                                                                               
-8272891  : -4.74%   : 2.24%  : -0.0071% : public: void __cdecl Compiler::fgCompactBlocks(struct BasicBlock *, struct BasicBlock *)                                                                                                        
-29121896 : -8.18%   : 7.88%  : -0.0251% : public: bool __cdecl Compiler::fgCanCompactBlocks(struct BasicBlock *, struct BasicBlock *)                                                                                                     

A couple of small things that could probably regain a bit:

  • FlowGraphNaturalLoops::Find always computes entry/exit edges, which we don't need here
  • optIdentifyLoopsForAlignment could be combined with the loop in placeLoopAlignInstructions to avoid an iteration over the blocks/loops (I've left a TODO-Cleanup about this)
  • The worklist used by FlowGraphNaturalLoops::FindNaturalLoopBlocks should be an ArrayStack to avoid a lot of allocations

Don't think I'll look at these as part of this PR, though.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Dec 10, 2023

Diffs

There seems to be some cases where we no longer align because a loop that was natural back in loop finding is no longer natural after the optimization phases have run. For example, in Lowering.InstructionReplacement:TESTtoBT, we have this flow graph after "Find loops". However, after "Optimize layout" we end up with this flow graph, which no longer has any natural loops. The opposite is also true -- sometimes we have unnatural loops that after optimizations are natural, so those get aligned now.

I'd suggest we wait and see what perf triage finds, and if we hit any cases we can either see if we should tone back the flow graph optimizations that break naturality, or generalize loop alignment to try to handle more general SCCs.

@jakobbotsch jakobbotsch marked this pull request as ready for review December 11, 2023 07:57
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Dec 11, 2023

cc @dotnet/jit-contrib PTAL @kunalspathak @BruceForstall

Diffs with alignment enabled. See above for reasons why. Also, I have some ideas for some clean ups that should get some TP back above, but would like to consider them separately.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much cleaner. Added some questions/comments.

}

#if FEATURE_EH_CALLFINALLY_THUNKS
if (!top->IsFirst() && top->Prev()->KindIs(BBJ_CALLFINALLY))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and below condition can be combined under same !top->IsFirst() check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do

}
else if ((loopTop->bbNatLoopNum != BasicBlock::NOT_IN_LOOP) && visitedLoopNum[loopTop->bbNatLoopNum])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we making sure that we won't end up with the situation where loop block appears before top of loop? Or it is is not possible in the new scheme? Is there a way to put an assert for such cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible -- optIdentifyLoopsForAlignment will only add BBF_LOOP_ALIGN to the lexical top-most block of candidates for loop alignment. Asserting it is not so simple, but I expect with my follow-up cleanup it will look more obvious (we will have a bit set like before tracking loops we have seen)

@@ -7072,7 +7078,7 @@ class Compiler
bool optLoopTableValid; // info in loop table should be valid
bool optLoopsRequirePreHeaders; // Do we require that all loops (in the loop table) have pre-headers?
unsigned char optLoopCount; // number of tracked loops
unsigned char loopAlignCandidates; // number of loops identified for alignment
unsigned loopAlignCandidates; // number of loops identified for alignment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting of the comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, odd that jit-format didn't fix this.

@@ -10146,7 +10150,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
// based on experimenting with various benchmarks.

// Default minimum loop block weight required to enable loop alignment.
#define DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT 4
#define DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this change about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from #62940:

To avoid too many diffs, DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT is
reduced from 4 to 3. This is because many loops (without profile data)
end up with a weight of 4, and if those loops are cloned, the weight
of the hot path gets reduced to 3.96 (so the cold path gets at least
some non-zero weight). We still want these hot path cloned loops to
be aligned. However, decreasing this does cause some more ASP.NET
loops with profile data and weights between 3 and 4 to be aligned.

// If the original loop is aligned, do not align the cloned loop because cloned loop will be executed in
// rare scenario. Additionally, having to align cloned loop will force us to disable some VEX prefix encoding
// and adding compensation for over-estimated instructions.
if (newBlk->isLoopAlign())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we planning to fix this because there was significant cost of including VEX prefix encoding if we continue to align the clone loops as well.

Copy link
Member Author

@jakobbotsch jakobbotsch Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't align cloned loops because they are given a very low bbWeight by loop cloning, so optIdentifyLoopsForAlignment knows that it isn't profitable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we verify that somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an ad hoc verification like this:

diff --git a/src/coreclr/jit/block.h b/src/coreclr/jit/block.h
index ae480701066..b854226a03a 100644
--- a/src/coreclr/jit/block.h
+++ b/src/coreclr/jit/block.h
@@ -431,6 +431,7 @@ enum BasicBlockFlags : unsigned __int64
                                                           // This is just to reduce diffs from removing BBJ_NONE.
                                                           // (TODO: Remove this quirk after refactoring Compiler::fgFindInsertPoint)
     BBF_OLD_LOOP_HEADER_QUIRK          = MAKE_BBFLAG(42), // Block was the header ('entry') of a loop recognized by old loop finding
+    BBF_CLONED_LOOP_HEADER             = MAKE_BBFLAG(43),
 
     // The following are sets of flags.
 
diff --git a/src/coreclr/jit/compiler.cpp b/src/coreclr/jit/compiler.cpp
index 853aa203037..b137e4348e8 100644
--- a/src/coreclr/jit/compiler.cpp
+++ b/src/coreclr/jit/compiler.cpp
@@ -5366,6 +5366,7 @@ void Compiler::optIdentifyLoopsForAlignment(FlowGraphNaturalLoops* loops, BlockT
             loopAlignCandidates++;
             assert(!top->isLoopAlign());
             top->SetFlags(BBF_LOOP_ALIGN);
+            assert(!loop->GetHeader()->HasFlag(BBF_CLONED_LOOP_HEADER));
             JITDUMP("Aligning " FMT_LP " that starts at " FMT_BB ", weight=" FMT_WT " >= " FMT_WT ".\n",
                     loop->GetIndex(), top->bbNum, topWeight, compareWeight);
         }
diff --git a/src/coreclr/jit/loopcloning.cpp b/src/coreclr/jit/loopcloning.cpp
index 36ba5609cb7..5002f69acd5 100644
--- a/src/coreclr/jit/loopcloning.cpp
+++ b/src/coreclr/jit/loopcloning.cpp
@@ -2197,6 +2197,11 @@ void Compiler::optCloneLoop(FlowGraphNaturalLoop* loop, LoopCloneContext* contex
         return BasicBlockVisit::Continue;
     });
 
+    BasicBlock* newHeader = nullptr;
+    bool result = blockMap->Lookup(loop->GetHeader(), &newHeader);
+    assert(result);
+    newHeader->SetFlags(BBF_CLONED_LOOP_HEADER);
+
     // Perform the static optimizations on the fast path.
     optPerformStaticOptimizations(loop, context DEBUGARG(true));
 
diff --git a/src/coreclr/scripts/superpmi.py b/src/coreclr/scripts/superpmi.py
index a06461be609..04b41614562 100644
--- a/src/coreclr/scripts/superpmi.py
+++ b/src/coreclr/scripts/superpmi.py
@@ -1869,7 +1869,7 @@ class SuperPMIReplayAsmDiffs:
 
         # These vars are force overridden in the SPMI runs for both the base and diff, always.
         replay_vars = {
-            "DOTNET_JitAlignLoops": "0", # disable loop alignment to filter noise
+            #"DOTNET_JitAlignLoops": "0", # disable loop alignment to filter noise
             "DOTNET_JitEnableNoWayAssert": "1",
             "DOTNET_JitNoForceFallback": "1",
         }

I hit the assert 2 times in coreclr_tests, but nowhere else.
I think the most reasonable "fix" would be to adjust this code in loop cloning:

// We assume that the fast path will run 99% of the time, and thus should get 99% of the block weights.
// The slow path will, correspondingly, get only 1% of the block weights. It could be argued that we should
// mark the slow path as "run rarely", since it really shouldn't execute (given the currently optimized loop
// conditions) except under exceptional circumstances.
const weight_t fastPathWeightScaleFactor = 0.99;
const weight_t slowPathWeightScaleFactor = 1.0 - fastPathWeightScaleFactor;

It should perhaps be even rarer than 1/100, or mark the cold loop as "run rarely" like the comment says... but I think we can consider it as a follow-up (or leave it as is, since 2 cases in coreclr_tests probably doesn't matter)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you verify on all the collections? Can you commit this change and see how superpmi-replay run goes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check the rest of the collections locally. I'd prefer not to have to go through two rounds of CI again for this... I do not think loop alignment should concern itself with how loop cloning created loops -- it should be loop cloning's responsibility to properly mark them as cold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check the rest of the collections locally

sure, let me know if you find out other failures.

it should be loop cloning's responsibility to properly mark them as cold.

yes, that's how it should be ideally, but don't think we do it today and the fact that you hit 2 asserts confirms it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's how it should be ideally, but don't think we do it today and the fact that you hit 2 asserts confirms it.

Sure -- but I don't think there is a requirement that 100% of cloned loops aren't aligned. I assume the problem in the past was that loop cloning would copy bbFlags directly, which already had BBF_LOOP_ALIGN set in it. Then we would end up aligning for all cold loops if the hot loop counterpart was aligned. But with the new approach there is a separate profitability calculation for both loops. If we think that the weighing in loop cloning is not right then sure, we should fix it, but I don't think it's a problem we need to fix as part of this PR.

I'll let you know the replay results once I have them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results: https://gist.github.com/jakobbotsch/03209c3e80fc9c6715a8a790fe8c697a

Looks like the same for all collections except win-x86 has 4 coreclr_tests contexts that hit the assert (it looks like the two extra ones are just duplicate contexts of the other two, not sure why they're only duplicated in the win-x86 collection)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with the new approach there is a separate profitability calculation for both loops

That makes sense and thanks for running the replay.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amanasifkhalid
Copy link
Member

@jakobbotsch I'm guessing #95773 introduced merge conflicts here -- sorry about that!

@jakobbotsch
Copy link
Member Author

No worries at all!

@jakobbotsch jakobbotsch merged commit d4bfbf1 into dotnet:main Dec 11, 2023
123 of 129 checks passed
@jakobbotsch jakobbotsch deleted the align-new-representation branch December 11, 2023 22:28
BruceForstall added a commit to BruceForstall/runtime that referenced this pull request Dec 15, 2023
Change dotnet#95836 removed a bit too much code, namely, a loop that set
`BBF_COLD` on newly determined cold blocks. Restore that loop.

Add some additional IG jump dumping.

Fixes dotnet#95946
BruceForstall added a commit that referenced this pull request Dec 16, 2023
Change #95836 removed a bit too much code, namely, a loop that set
`BBF_COLD` on newly determined cold blocks. Restore that loop.

Add some additional IG jump dumping.

Fixes #95946
@cincuranet
Copy link
Contributor

@github-actions github-actions bot locked and limited conversation to collaborators Jan 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants