-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Port loop alignment to new loop representation #95836
JIT: Port loop alignment to new loop representation #95836
Conversation
* Revive Bruce's old dotnet#62940: do alignment candidate identification during the main alignment phase. Do this using the new loop structures instead. * To avoid maintaining loop information into the backend we reidentify loops during the phase, so there is some TP cost. It's minimized by the fact that we can conservatively know whether there might be any loops. Also, the DFS we need to do is likely to be used in the future for block layout. * Introduce `BBF_OLD_LOOP_HEADER_QUIRK`; it is a simpler version of the `m_newToOldLoop` array that we currently have. The next work item will be to switch all the loop-related quirks onto this flag instead of using `m_newToOldLoop`. A few alignment related diffs expected -- sometimes from different insertion of alignment (because the old `BasicBlock::bbNatLoopNum` was not up-to-date), other times from more alignment because of the weight change (see dotnet#62940 for the details on that). Also, we allow compaction of some loop headers now that previously would be rejected from being marked with the "align" flag, so that also causes some diffs.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Details
A few alignment related diffs expected -- sometimes from different insertion of alignment (because the old
|
Detailed tp diff for benchmarks.run_pgo win-x64: Base: 115920087166, Diff: 116185166330, +0.2287%
88008437 : +21.66% : 23.81% : +0.0759% : `Compiler::fgRunDfs<`Compiler::fgComputeDfs'::`2'::<lambda_1>,`Compiler::fgComputeDfs'::`2'::<lambda_2> >'::`2'::<lambda_1>::operator()
44706430 : +65.61% : 12.09% : +0.0386% : public: static class FlowGraphNaturalLoops * __cdecl FlowGraphNaturalLoops::Find(class FlowGraphDfsTree const *)
31255833 : +70.36% : 8.46% : +0.0270% : private: static bool __cdecl FlowGraphNaturalLoops::FindNaturalLoopBlocks(class FlowGraphNaturalLoop *, class jitstd::list<struct BasicBlock *, class jitstd::allocator<struct BasicBlock *>> &)
28438047 : +21.39% : 7.69% : +0.0245% : BasicBlock::VisitAllSuccs<`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >
17122462 : +62.60% : 4.63% : +0.0148% : public: bool __cdecl FlowGraphNaturalLoop::ContainsBlock(struct BasicBlock *)
16611283 : +73.05% : 4.49% : +0.0143% : BasicBlock::VisitRegularSuccs<``FlowGraphNaturalLoops::Find'::`4'::<lambda_1>::operator()'::`2'::<lambda_1> >
16530668 : NA : 4.47% : +0.0143% : public: void __cdecl Compiler::optIdentifyLoopsForAlignment(class FlowGraphNaturalLoops *, class BlockToNaturalLoopMap *)
12342223 : +0.48% : 3.34% : +0.0106% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
12281199 : +1.06% : 3.32% : +0.0106% : public: bool __cdecl Compiler::fgUpdateFlowGraph(bool, bool)
11971741 : +21.53% : 3.24% : +0.0103% : VisitEHSuccs<0,`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >
9970991 : +68.07% : 2.70% : +0.0086% : public: static class BlockToNaturalLoopMap * __cdecl BlockToNaturalLoopMap::Build(class FlowGraphNaturalLoops *)
9760342 : +24.01% : 2.64% : +0.0084% : public: void __cdecl jitstd::vector<struct FlowEdge *, class jitstd::allocator<struct FlowEdge *>>::push_back(struct FlowEdge *const &)
5158149 : NA : 1.40% : +0.0044% : `FlowGraphNaturalLoop::VisitLoopBlocksReversePostOrder<`Compiler::optIdentifyLoopsForAlignment'::`16'::<lambda_1> >'::`2'::<lambda_1>::operator()
4852081 : +11.53% : 1.31% : +0.0042% : public: struct FlowEdge * __cdecl Compiler::BlockPredsWithEH(struct BasicBlock *)
1649730 : +0.98% : 0.45% : +0.0014% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)
1351647 : +12.52% : 0.37% : +0.0012% : Compiler::fgRunDfs<`Compiler::fgComputeDfs'::`2'::<lambda_1>,`Compiler::fgComputeDfs'::`2'::<lambda_2> >
872195 : +45.38% : 0.24% : +0.0008% : public: enum PhaseStatus __cdecl Compiler::placeLoopAlignInstructions(void)
743280 : +11.59% : 0.20% : +0.0006% : public: void __cdecl ArrayStack<class AllSuccessorEnumerator>::Emplace<class Compiler *const, struct BasicBlock *&>(class Compiler *const &&, struct BasicBlock *&)
617223 : +1.16% : 0.17% : +0.0005% : private: void __cdecl Compiler::optComputeLoopSideEffectsOfBlock(struct BasicBlock *, class FlowGraphNaturalLoop *)
557460 : +12.13% : 0.15% : +0.0005% : public: class FlowGraphDfsTree * __cdecl Compiler::fgComputeDfs(void)
551023 : +25.44% : 0.15% : +0.0005% : BasicBlock::VisitEHEnclosedHandlerSecondPassSuccs<`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >
-484770 : -0.11% : 0.13% : -0.0004% : __security_check_cookie
-628164 : -2.06% : 0.17% : -0.0005% : public: void __cdecl Compiler::optFindLoops(void)
-690205 : -4.59% : 0.19% : -0.0006% : public: void __cdecl Compiler::fgRemoveBlock(struct BasicBlock *, bool)
-824970 : -0.46% : 0.22% : -0.0007% : protected: void __cdecl Compiler::compInitOptions(class JitFlags *)
-865299 : -99.38% : 0.23% : -0.0007% : HasOldChildLoop
-3163170 : -0.35% : 0.86% : -0.0027% : memset
-7689284 : -100.00% : 2.08% : -0.0066% : protected: void __cdecl Compiler::AddContainsCallAllContainingLoops(class FlowGraphNaturalLoop *)
-8272891 : -4.74% : 2.24% : -0.0071% : public: void __cdecl Compiler::fgCompactBlocks(struct BasicBlock *, struct BasicBlock *)
-29121896 : -8.18% : 7.88% : -0.0251% : public: bool __cdecl Compiler::fgCanCompactBlocks(struct BasicBlock *, struct BasicBlock *) A couple of small things that could probably regain a bit:
Don't think I'll look at these as part of this PR, though. |
There seems to be some cases where we no longer align because a loop that was natural back in loop finding is no longer natural after the optimization phases have run. For example, in I'd suggest we wait and see what perf triage finds, and if we hit any cases we can either see if we should tone back the flow graph optimizations that break naturality, or generalize loop alignment to try to handle more general SCCs. |
cc @dotnet/jit-contrib PTAL @kunalspathak @BruceForstall Diffs with alignment enabled. See above for reasons why. Also, I have some ideas for some clean ups that should get some TP back above, but would like to consider them separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much cleaner. Added some questions/comments.
src/coreclr/jit/compiler.cpp
Outdated
} | ||
|
||
#if FEATURE_EH_CALLFINALLY_THUNKS | ||
if (!top->IsFirst() && top->Prev()->KindIs(BBJ_CALLFINALLY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and below condition can be combined under same !top->IsFirst()
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do
} | ||
else if ((loopTop->bbNatLoopNum != BasicBlock::NOT_IN_LOOP) && visitedLoopNum[loopTop->bbNatLoopNum]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are we making sure that we won't end up with the situation where loop block appears before top of loop? Or it is is not possible in the new scheme? Is there a way to put an assert for such cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible -- optIdentifyLoopsForAlignment
will only add BBF_LOOP_ALIGN
to the lexical top-most block of candidates for loop alignment. Asserting it is not so simple, but I expect with my follow-up cleanup it will look more obvious (we will have a bit set like before tracking loops we have seen)
src/coreclr/jit/compiler.h
Outdated
@@ -7072,7 +7078,7 @@ class Compiler | |||
bool optLoopTableValid; // info in loop table should be valid | |||
bool optLoopsRequirePreHeaders; // Do we require that all loops (in the loop table) have pre-headers? | |||
unsigned char optLoopCount; // number of tracked loops | |||
unsigned char loopAlignCandidates; // number of loops identified for alignment | |||
unsigned loopAlignCandidates; // number of loops identified for alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting of the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, odd that jit-format didn't fix this.
@@ -10146,7 +10150,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | |||
// based on experimenting with various benchmarks. | |||
|
|||
// Default minimum loop block weight required to enable loop alignment. | |||
#define DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT 4 | |||
#define DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this change about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from #62940:
To avoid too many diffs, DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT is
reduced from 4 to 3. This is because many loops (without profile data)
end up with a weight of 4, and if those loops are cloned, the weight
of the hot path gets reduced to 3.96 (so the cold path gets at least
some non-zero weight). We still want these hot path cloned loops to
be aligned. However, decreasing this does cause some more ASP.NET
loops with profile data and weights between 3 and 4 to be aligned.
// If the original loop is aligned, do not align the cloned loop because cloned loop will be executed in | ||
// rare scenario. Additionally, having to align cloned loop will force us to disable some VEX prefix encoding | ||
// and adding compensation for over-estimated instructions. | ||
if (newBlk->isLoopAlign()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we planning to fix this because there was significant cost of including VEX prefix encoding if we continue to align the clone loops as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't align cloned loops because they are given a very low bbWeight
by loop cloning, so optIdentifyLoopsForAlignment
knows that it isn't profitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we verify that somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an ad hoc verification like this:
diff --git a/src/coreclr/jit/block.h b/src/coreclr/jit/block.h
index ae480701066..b854226a03a 100644
--- a/src/coreclr/jit/block.h
+++ b/src/coreclr/jit/block.h
@@ -431,6 +431,7 @@ enum BasicBlockFlags : unsigned __int64
// This is just to reduce diffs from removing BBJ_NONE.
// (TODO: Remove this quirk after refactoring Compiler::fgFindInsertPoint)
BBF_OLD_LOOP_HEADER_QUIRK = MAKE_BBFLAG(42), // Block was the header ('entry') of a loop recognized by old loop finding
+ BBF_CLONED_LOOP_HEADER = MAKE_BBFLAG(43),
// The following are sets of flags.
diff --git a/src/coreclr/jit/compiler.cpp b/src/coreclr/jit/compiler.cpp
index 853aa203037..b137e4348e8 100644
--- a/src/coreclr/jit/compiler.cpp
+++ b/src/coreclr/jit/compiler.cpp
@@ -5366,6 +5366,7 @@ void Compiler::optIdentifyLoopsForAlignment(FlowGraphNaturalLoops* loops, BlockT
loopAlignCandidates++;
assert(!top->isLoopAlign());
top->SetFlags(BBF_LOOP_ALIGN);
+ assert(!loop->GetHeader()->HasFlag(BBF_CLONED_LOOP_HEADER));
JITDUMP("Aligning " FMT_LP " that starts at " FMT_BB ", weight=" FMT_WT " >= " FMT_WT ".\n",
loop->GetIndex(), top->bbNum, topWeight, compareWeight);
}
diff --git a/src/coreclr/jit/loopcloning.cpp b/src/coreclr/jit/loopcloning.cpp
index 36ba5609cb7..5002f69acd5 100644
--- a/src/coreclr/jit/loopcloning.cpp
+++ b/src/coreclr/jit/loopcloning.cpp
@@ -2197,6 +2197,11 @@ void Compiler::optCloneLoop(FlowGraphNaturalLoop* loop, LoopCloneContext* contex
return BasicBlockVisit::Continue;
});
+ BasicBlock* newHeader = nullptr;
+ bool result = blockMap->Lookup(loop->GetHeader(), &newHeader);
+ assert(result);
+ newHeader->SetFlags(BBF_CLONED_LOOP_HEADER);
+
// Perform the static optimizations on the fast path.
optPerformStaticOptimizations(loop, context DEBUGARG(true));
diff --git a/src/coreclr/scripts/superpmi.py b/src/coreclr/scripts/superpmi.py
index a06461be609..04b41614562 100644
--- a/src/coreclr/scripts/superpmi.py
+++ b/src/coreclr/scripts/superpmi.py
@@ -1869,7 +1869,7 @@ class SuperPMIReplayAsmDiffs:
# These vars are force overridden in the SPMI runs for both the base and diff, always.
replay_vars = {
- "DOTNET_JitAlignLoops": "0", # disable loop alignment to filter noise
+ #"DOTNET_JitAlignLoops": "0", # disable loop alignment to filter noise
"DOTNET_JitEnableNoWayAssert": "1",
"DOTNET_JitNoForceFallback": "1",
}
I hit the assert 2 times in coreclr_tests, but nowhere else.
I think the most reasonable "fix" would be to adjust this code in loop cloning:
runtime/src/coreclr/jit/loopcloning.cpp
Lines 2021 to 2026 in 939f28a
// We assume that the fast path will run 99% of the time, and thus should get 99% of the block weights. | |
// The slow path will, correspondingly, get only 1% of the block weights. It could be argued that we should | |
// mark the slow path as "run rarely", since it really shouldn't execute (given the currently optimized loop | |
// conditions) except under exceptional circumstances. | |
const weight_t fastPathWeightScaleFactor = 0.99; | |
const weight_t slowPathWeightScaleFactor = 1.0 - fastPathWeightScaleFactor; |
It should perhaps be even rarer than 1/100, or mark the cold loop as "run rarely" like the comment says... but I think we can consider it as a follow-up (or leave it as is, since 2 cases in coreclr_tests probably doesn't matter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you verify on all the collections? Can you commit this change and see how superpmi-replay run goes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the rest of the collections locally. I'd prefer not to have to go through two rounds of CI again for this... I do not think loop alignment should concern itself with how loop cloning created loops -- it should be loop cloning's responsibility to properly mark them as cold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the rest of the collections locally
sure, let me know if you find out other failures.
it should be loop cloning's responsibility to properly mark them as cold.
yes, that's how it should be ideally, but don't think we do it today and the fact that you hit 2 asserts confirms it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's how it should be ideally, but don't think we do it today and the fact that you hit 2 asserts confirms it.
Sure -- but I don't think there is a requirement that 100% of cloned loops aren't aligned. I assume the problem in the past was that loop cloning would copy bbFlags
directly, which already had BBF_LOOP_ALIGN
set in it. Then we would end up aligning for all cold loops if the hot loop counterpart was aligned. But with the new approach there is a separate profitability calculation for both loops. If we think that the weighing in loop cloning is not right then sure, we should fix it, but I don't think it's a problem we need to fix as part of this PR.
I'll let you know the replay results once I have them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results: https://gist.github.com/jakobbotsch/03209c3e80fc9c6715a8a790fe8c697a
Looks like the same for all collections except win-x86 has 4 coreclr_tests contexts that hit the assert (it looks like the two extra ones are just duplicate contexts of the other two, not sure why they're only duplicated in the win-x86 collection)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But with the new approach there is a separate profitability calculation for both loops
That makes sense and thanks for running the replay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jakobbotsch I'm guessing #95773 introduced merge conflicts here -- sorry about that! |
No worries at all! |
Change dotnet#95836 removed a bit too much code, namely, a loop that set `BBF_COLD` on newly determined cold blocks. Restore that loop. Add some additional IG jump dumping. Fixes dotnet#95946
BBF_OLD_LOOP_HEADER_QUIRK
; it is a simpler version of them_newToOldLoop
array that we currently have. The next work item will be to switch all the loop-related quirks onto this flag instead of usingm_newToOldLoop
.A few alignment related diffs expected -- sometimes from different insertion of alignment (because the old
BasicBlock::bbNatLoopNum
was not up-to-date), other times from more alignment because of the weight change (see #62940 for the details on that). Also, we allow compaction of some loop tops now that previously would be rejected from being marked with the "align" flag, so that also causes some diffs.