Less codegen parallelism than expected with `-C codegen-units=16` #64913

SimonSapin · 2019-09-30T07:48:48Z

This is the output of cargo build -Z timings for Servo, on a 14 cores / 28 threads machine, filtered with "Min unit time" at 10 seconds.

Edit: with Rust nightly-2019-09-28

For most of codegen for the script crate only 3 to 4 threads seem to be CPU-bound, leaving other cores idle, despite codegen-units=16 being the default. (Results are similar if I specify it explicitly.)

Shouldn’t CPU usage be much closer to min(ncpu, codegen-units)?

The text was updated successfully, but these errors were encountered:

andjo403 · 2019-09-30T20:25:20Z

the problem is the single threaded mir to llvm-ir codegen that can not provide work for all the threads in the parallell llvm-ir to machinecode codegen, as this is a debug build there is not that much that llvm do resulting in low cpu usage maybe is better for a release build

thread0 is showing the mir to llvm-ir and the other is showing parallell work not reel threads doing the llvm-ir to machinecode codegen.

the image is the result of using the -Z self-profile option to rustc and then filter out a lot of traces from the result of crox --compact_thread due to chrome was not able to show the 1.4 GB json file on my computer :)

SimonSapin · 2019-09-30T21:37:32Z

Thanks for looking into it @andjo403! Below is -Z timings or release mode. It’s notable that linking is almost 3× as fast. As to codegen for the script crate we do see good CPU utilization but only for some of that time. I’ll try -Z self-profile at some point.

alexcrichton · 2019-10-01T18:47:22Z

This is actually something you may want to investigate with the script crate more deeply, but it's unfortunately difficult to do so. This may also be indicative of a compiler bug!

Translation to LLVM IR is currently single-threaded, and so in debug mode we're firing off LLVM modules to get codegen'd in the background, but they codegen so quickly that's why it seems to never get cpu usage off the ground (most time taken is spent translating to LLVM IR). The release mode case is similar, but has one more synchronization point. Instead of going directly to codegen the phases look like:

One thread translates to LLVM IR
As each LLVM module is read, it's, in the background in parallel, optimized
When all modules are optimized, a brief moment of serial work happens to perform ThinLTO analysis
Then, in parallel and all at once, all modules are optimized again with ThinLTO

From the start of script's purple bar in the release mode case you see humps of large CPU usage, and that's where the "translate to LLVM IR" thread produced a bunch of modules that are all being optimized on worker threads. You then see a long consistent spike of optimization, and most likely this is after the ThinLTO passes (and the brief moment of single-core usage is the ThinLTO analysis pass). The script crate has a long plateau of single-core usage, though, still at the end. This likely means that there's one CGU in particular that takes a huge amount of time to optimize, causing one CGU to be excessively slower than all the other ones.

This looks very familiar to a case I diagnosed a long time ago and I explained how I found that awhile back. We don't have -Z trans-time-graph any more, but you may be able to manually inspect object files to look for huge functions or something like that. (-Z self-profile can surely help here too)

andjo403 · 2019-10-01T20:56:18Z

did a run with -Z self-profile for the release build and @alexcrichton is correct there is one cgu that is extra long I do not know what the cgu contains due to I do not know how to only build the script crate to save the llvm-ir files.

edit: the log is filtered with all events that is shorter then 1000us is removed. crox --collapse-threads --minimum-duration=1000

alexcrichton · 2019-10-01T21:15:00Z

Man those graphs are so cool! It's so visually appalling what's happening there. You can easily see the waterfall of codegen units being spawned for optimization, you can see the brief serial period, and then everything getting optimized at once. Neato!

But yeah so from that we can clearly see that the ThinLTO passes are taking forever on one particular codegen unit, and the normal optimization passes are taking ~2x longer than all other CGUs. All that needs to be done is to go find the name of that CGU in the rlib and dig it out. The best way to do that is probably to:

Use ar x to extract the contents of the rlib
Look for an object file *.o of a matching name, as well as a *.z file.
The *.z file follows a strange bytecode encoding which you'd need custom code to extract and look at. We should probably change that to *.gz at some point.
The *.o file you should be able to poke around and perhaps just look for any huge functions. If something looks abnormally massive that may mean it's taking awhile.

In general though I don't really know great ways to dig into this. I'm not sure how to take a slow CGU in LLVM and then diagnose why it's slow. Step one is getting the LLVM IR out, but that's difficult enough to do with rustc.

One thing that may also work is to compile with RUSTFLAGS=-Csave-temps which should spew artifacts all over the place, but there should be at least one intermediate bytecode file with the matching filename somwhere.

andjo403 · 2019-10-04T19:16:57Z

made a little POC with adding the clang time-trace feature in the self-profiling and with that I can see that 34% of the time in llvm for the long cgu is compiling <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once do not know how the llvm passes is executed but there is 16565 occurrences of trying to optimize this function

jdm · 2019-10-04T19:39:36Z

That likely comes from huge numbers of generated functions from this code generator calling this panic catching helper.

SimonSapin · 2019-10-04T19:51:02Z

If a single generic function has many instantiations, do they all end up in the same CGU?

andjo403 · 2019-10-05T14:05:54Z

I think that all generic functions that is instantiated in the same module is in the same CGU if I understand the description here https:/rust-lang/rust/blob/master/src/librustc_mir/monomorphize/partitioning.rs#L44

@michaelwoerister do you have some idea of how to split the CGU more?

rustc also have some generic functions that make one CGU larger then the rest

andjo403 · 2019-10-06T20:50:45Z

have been looking in to more detail what happen in the partitioning for script and here is the stats

CGU	CGU not merged	post merge estimated size	post inlining estimated size	Optimize time (s)	Lto time (s)	total time (s)
0	yes	848956	1467083	41,66	69,97	111,63
1	yes	524765	585184	25,65	18,41	44,06
2	yes	400296	819963	85,67	70,48	156,15
3		321669	564648	37,86	71,88	109,74
4		298598	413214	22,13	42,82	64,95
5		273580	385766	20,98	39,33	60,31
6		257079	567954	27,82	48,04	75,86
7		244523	390903	23,22	47,90	71,12
8		234306	484797	29,75	50,10	79,85
9		226011	614406	47,01	52,94	99,95
10	yes	220040	220122	12,98	30,26	43,24
11		219207	380161	20,67	30,72	51,39
12		208098	355017	20,58	38,35	58,93
13		194715	323289	20,27	39,03	59,3
14		188629	439558	28,28	40,81	69,09
15		356672	1221227	89,49	118,34	207,83

so some conclusions from the data is that:

the merging needs to look at the inlining data to get a better estimate of the size.
In this case CGU 15 that after inlining is second largest according to the estimates ( takes the longest time to compile) have been merged
four CGUs is so large from the beginning that they are not merged with any other CGU and after merge is three still larger then the rest so need to be able to split large CGUs to get more even sizes
the time estimate is sometimes way of eg. GCU 0 is estimated to be the largest but is about half the time of CGU 15

SimonSapin · 2019-10-06T21:31:04Z

What is CGU merging?

andjo403 · 2019-10-06T21:50:56Z

it is part of the partitioning algorithm when either -Ccodegen-units=val have been added or Incremental is off, the default is 16 CGUs.
for incremental there is somtimes thousands of CGUs and as the partitioning algorithm is the same the CGUs needs to be merged to match the requested number of CGUs.

SimonSapin · 2019-10-06T22:42:05Z

So, does "unmerged" mean that one unit for the purpose of codegen only contains one unit for the purpose of incremental?

andjo403 · 2019-10-07T06:08:20Z

yes "unmerged" was an CGU that have not been merged during the merge phase will rename it to "not merged"

add more info in debug traces for gcu merging to help in investigation of CGU partitioning problems e.g rust-lang#64913

andjo403 · 2019-10-10T21:07:12Z

have made #65281 to handle the first of my points from above.
but I also started to think about this case that is using lto shall the inline function even be added to all CGUs?

andjo403 · 2019-10-10T21:24:44Z

hmm seems like some one tested the lto idea already #65052

andjo403 · 2019-10-10T21:57:42Z

I wonder if it is possible to not only look at the size of the CGUs but also what inline items that are common to get more items with internal linkage and less duplication between CGUs

for a more even partitioning inline before merge consider the size of the inlined items for a more even partitioning for me this change take the compile time for script-servo-opt from 306s to 249s edit: the times is for a 32 thread cpu. cc #64913

csmoe added the A-codegen Area: Code generation label Sep 30, 2019

jonas-schievink added I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Sep 30, 2019

andjo403 mentioned this issue Oct 6, 2019

add more info in debug traces for gcu merging #65175

Merged

Centril added a commit to Centril/rust that referenced this issue Oct 8, 2019

Rollup merge of rust-lang#65175 - andjo403:partitioning, r=zackmdavis

5c7210e

add more info in debug traces for gcu merging to help in investigation of CGU partitioning problems e.g rust-lang#64913

Centril added a commit to Centril/rust that referenced this issue Oct 9, 2019

Rollup merge of rust-lang#65175 - andjo403:partitioning, r=zackmdavis

db99e5a

add more info in debug traces for gcu merging to help in investigation of CGU partitioning problems e.g rust-lang#64913

Centril added a commit to Centril/rust that referenced this issue Oct 9, 2019

Rollup merge of rust-lang#65175 - andjo403:partitioning, r=zackmdavis

e27f029

add more info in debug traces for gcu merging to help in investigation of CGU partitioning problems e.g rust-lang#64913

andjo403 mentioned this issue Oct 10, 2019

for a more even partitioning inline before merge #65281

Closed

andjo403 mentioned this issue Feb 22, 2020

instance_def_size_estimate is not that good at estimating #69382

Open

Sl1mb0 mentioned this issue Sep 26, 2021

Optimize codegen scheduling #89281

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less codegen parallelism than expected with `-C codegen-units=16` #64913

Less codegen parallelism than expected with `-C codegen-units=16` #64913

SimonSapin commented Sep 30, 2019 •

edited

Loading

andjo403 commented Sep 30, 2019

SimonSapin commented Sep 30, 2019

alexcrichton commented Oct 1, 2019

andjo403 commented Oct 1, 2019 •

edited

Loading

alexcrichton commented Oct 1, 2019

andjo403 commented Oct 4, 2019

jdm commented Oct 4, 2019

SimonSapin commented Oct 4, 2019

andjo403 commented Oct 5, 2019

andjo403 commented Oct 6, 2019 •

edited

Loading

SimonSapin commented Oct 6, 2019

andjo403 commented Oct 6, 2019

SimonSapin commented Oct 6, 2019

andjo403 commented Oct 7, 2019

andjo403 commented Oct 10, 2019

andjo403 commented Oct 10, 2019

andjo403 commented Oct 10, 2019

Less codegen parallelism than expected with -C codegen-units=16 #64913

Less codegen parallelism than expected with -C codegen-units=16 #64913

Comments

SimonSapin commented Sep 30, 2019 • edited Loading

andjo403 commented Sep 30, 2019

SimonSapin commented Sep 30, 2019

alexcrichton commented Oct 1, 2019

andjo403 commented Oct 1, 2019 • edited Loading

alexcrichton commented Oct 1, 2019

andjo403 commented Oct 4, 2019

jdm commented Oct 4, 2019

SimonSapin commented Oct 4, 2019

andjo403 commented Oct 5, 2019

andjo403 commented Oct 6, 2019 • edited Loading

SimonSapin commented Oct 6, 2019

andjo403 commented Oct 6, 2019

SimonSapin commented Oct 6, 2019

andjo403 commented Oct 7, 2019

andjo403 commented Oct 10, 2019

andjo403 commented Oct 10, 2019

andjo403 commented Oct 10, 2019

Less codegen parallelism than expected with `-C codegen-units=16` #64913

Less codegen parallelism than expected with `-C codegen-units=16` #64913

SimonSapin commented Sep 30, 2019 •

edited

Loading

andjo403 commented Oct 1, 2019 •

edited

Loading

andjo403 commented Oct 6, 2019 •

edited

Loading