rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736

tgnottingham · 2021-02-04T02:56:51Z

For better throughput during parallel processing by LLVM, we used to sort
CGUs largest to smallest. This would lead to better thread utilization
by, for example, preventing a large CGU from being processed last and
having only one LLVM thread working while the rest remained idle.

However, this strategy would lead to high memory usage, as it meant the
LLVM-IR for all of the largest CGUs would be resident in memory at once.

Instead, we can compromise by ordering CGUs such that the largest and
smallest are first, second largest and smallest are next, etc. If there
are large size variations, this can reduce memory usage significantly.

For better throughput during parallel processing by LLVM, we used to sort CGUs largest to smallest. This would lead to better thread utilization by, for example, preventing a large CGU from being processed last and having only one LLVM thread working while the rest remained idle. However, this strategy would lead to high memory usage, as it meant the LLVM-IR for all of the largest CGUs would be resident in memory at once. Instead, we can compromise by ordering CGUs such that the largest and smallest are first, second largest and smallest are next, etc. If there are large size variations, this can reduce memory usage significantly.

rust-highfive · 2021-02-04T02:56:53Z

r? @davidtwco

(rust-highfive has picked a reviewer for you, use r? to override)

tgnottingham · 2021-02-04T02:57:58Z

@rustbot label T-compiler I-compilemem

tgnottingham · 2021-02-04T03:38:36Z

@bors try @rust-timer queue

rust-timer · 2021-02-04T03:38:37Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-02-04T03:38:55Z

⌛ Trying commit 29711d8 with merge 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b...

bors · 2021-02-04T04:39:20Z

☀️ Try build successful - checks-actions
Build commit: 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b (0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b)

rust-timer · 2021-02-04T04:39:22Z

Queued 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b with parent e708cbd, future comparison URL.

rust-timer · 2021-02-04T06:18:44Z

Finished benchmarking try commit (0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

tgnottingham · 2021-02-04T07:11:31Z

This change completely depends on the CGU size distribution, so not seeing an improvement across the board is expected. Also, ignore keccak-debug and keccak-opt RSS stats. They vary +/- 15% from run to run.

Results are more consistent for non-incremental full builds. This change benefits crates with large variations in CGU sizes, and my guess is that full builds produce larger variations.

It seems reasonable that they would, because they do more CGU merging than incremental builds. Supposing all CGUs are roughly the same size to start, then unless the number of CGUs is just right, merging causes us to have one set of CGUs of size N and another of size 2N. But even if they don't start out the same size, the merging process can get us to a point where the CGUs do end up being roughly the same size. And we end up in the same place from there.

On my system, this change reduces peak memory usage while compiling rustc_middle by a whopping 500MB, for both incremental and non-incremental. That crate has some very outsized CGUs, so the change is particularly helpful there.

nagisa · 2021-02-04T22:36:22Z

@bors r+

Pretty nice improvements! I don't see any egregious comptime regressions, in particular its reassuring that the bootstrap wall time is -0.1% overall.

bors · 2021-02-04T22:36:23Z

📌 Commit 29711d8 has been approved by nagisa

bors · 2021-02-04T22:36:24Z

🌲 The tree is currently closed for pull requests below priority 1000. This pull request will be tested once the tree is reopened.

tgnottingham · 2021-02-04T23:02:51Z

Thanks @nagisa!

One thing I didn't address with this change is the fact that as we add compiled CGUs to the optimization queue, LLVM threads pick the larger CGUs to work on first. I suspect this doesn't affect memory usage much, because even if we processed them in codegen order, the LLVM threads would finish with the small CGUs quickly, and we'd end up with the largest CGUs being processed concurrently anyway. But it's something to experiment with.

Edit: I think there's more room for improvement than I initially thought. After all, this change doesn't directly address the issue mentioned above (chewing through small CGUs quickly so that we still end up with the largest CGUs in memory). It can help by delaying introduction of large CGUs for a bit, which sometimes means that a large CGU being processed gets finished with and dropped before another large CGU is codegen'd. But I'm sure we can do better than that in many cases.

SunnyWar · 2021-02-05T08:50:25Z

This is fascinating. I did some work years ago on optimal scheduling and bin packing is an easy and often used solution. I'm intrigued by the premise of this PR. Consequently, I wrote a little program to explore various schedule schemes to see how well they might work while also evaluating memory pressure. But, to match my model to this work I need to know some data.

How many threads is work be dispatched to, typically?
How large are the CGU's and what is the typical range of values?
How many CGU's in total usually are scheduled?

bors · 2021-02-05T09:20:56Z

⌛ Testing commit 29711d8 with merge 730d6df...

tgnottingham · 2021-02-05T10:13:12Z

@SunnyWar, thanks for the bin-packing reference. I assumed there had to be some well-studied formalization of the problem or related problems. My hope was that the simple approach in this PR would improve the situation until I or someone could put the time into a better approach.

1. How many threads is work be dispatched to, typically?

The number of CPUs or hyperthreads on the system. No consideration of the amount of system memory is made, so as you can imagine, this has potential to cause problems on high CPU count systems without a lot of memory.

2. How large are the CGU's and what is the typical range of values?

We have two size estimates that we base scheduling decisions on. The first is the number of statements in the MIR of the CGU. The second is the time it takes to codegen the MIR into the initial, unoptimized LLVM-IR. We have the first estimate for all CGUs before we start scheduling. The second estimate is only available as we go along codegening CGUs to LLVM-IR.

I don't know what's typical. It can vary wildly from crate to crate and depend on compilation mode.

3. How many CGU's in total usually are scheduled?

Depends on the mode of compilation: typically 16 for non-incremental builds and up to 256 for incremental builds. By default, that means 16 for release builds and up to 256 for debug builds.

Also, things are slightly more complex than this, as there are different kinds of work items that can be in the queue, there are some different phases in the process, etc. I'm just getting familiar with the area TBH.

bors · 2021-02-05T12:10:58Z

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing 730d6df to master...

SunnyWar · 2021-02-05T18:03:48Z

@tgnottingham I did a quick check assuming only 8 threads with 256 tasks. The "size" of the task was evenly distributed 1-256. The model also assumes the runtime of a task is directly proportional to its size.
Using your method the result is only a 0.2% reduction in memory pressure with a 5.7% increase in runtime. This makes sense as the smaller tasks finish quickly and very soon you have all the largest tasks filling all the threads. However, I also modeled (as a control) a random distribution. This resulted in a 15.5% reduction in memory pressure with a 5.8% increase in runtime.
These are all theoretical numbers based on a possibly flawed simulation with completely made-up data. But it makes me wonder if you have considered trying something like randomizing the data and see how it does? This also has me thinking of an algorithm to more deterministically "flatten" the memory pressure, which I'll probably fiddle with this weekend.
Thanks for bringing this fascinating problem to my attention.

tgnottingham · 2021-02-05T22:42:32Z

@SunnyWar, awesome! If you'd like to make your simulation more accurate, feel free to ask more questions, or you can go straight to the source in compiler/rustc_codegen_ssa/src/base.rs and compiler/rustc_codegen_ssa/src/back/write.rs.

One thing I neglected to mention is that we actually have more codegen'd CGUs in memory than there are LLVM threads running, so that when an LLVM thread finishes optimizing one CGU, there's already another codegen'd CGU ready for it to work on in the queue. Long story short, we basically ramp up to keeping 1.5 x ncpus codegen'd CGUs in memory at once during the bulk of processing. This is another thing I'm sure we'd want to make more intelligent, but that's the state of things.

By the way, I have a change up for review (#81538) that will show the CGU cost estimates in -Z time-passes output. For example, 502 is the cost estimate for the CGU corresponding to this time-passes entry:

time:   0.013; rss:  178MB ->  201MB (  +23MB)  codegen_module(1wpvv7qlzclv3mr5, 502)

If that lands, it will be easy for you to see real cost estimate distributions by compiling crates with the nightly compiler (cargo +nightly rustc -- -Z time-passes).

tgnottingham · 2021-02-05T23:20:23Z

By the way, here's a contrived example that shows how profitable work in this area could be.

Suppose we have 2 CPUs, 2 jobs of size 8N, 8 jobs of size N, and we ignore a ton of details.

We used to schedule the work something like this:

      Time -->
CPU0: |------8N------||-N||-N||-N||-N|
CPU1: |------8N------||-N||-N||-N||-N|

This utilizes CPUs ideally, and so minimizes runtime (supposing the high memory usage doesn't have detrimental effects). But it maximizes memory usage by working on the largest CGUs concurrently. We'll call the peak memory cost for this schedule 16N.

We could have utilized CPUs ideally, while minimizing memory usage, with this schedule:

      Time -->
CPU0: |------8N------||------8N------|
CPU1: |-N||-N||-N||-N||-N||-N||-N||-N|

The peak memory cost here is 9N, a 44% reduction in memory usage versus the first approach.

Of course, we can't always minimize memory usage and maximize throughput. Any solution to this scheduling problem will need to make good tradeoffs. E.g. if we can reduce memory usage by 30% at a 1% cost to runtime, it's probably the right tradeoff. This is especially true when it enables us to parallelize more (both within rustc and across multiple rustcs spawned by cargo) without risking swapping or thrashing, because then we get a runtime improvement too.

rust-highfive assigned davidtwco Feb 4, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 4, 2021

rustbot added I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 4, 2021

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 4, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 4, 2021

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 4, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 5, 2021

bors merged commit 730d6df into rust-lang:master Feb 5, 2021

rustbot added this to the 1.51.0 milestone Feb 5, 2021

tgnottingham mentioned this pull request Mar 1, 2021

Optimize codegen scheduling for memory usage and compile time #82685

Closed

Sl1mb0 mentioned this pull request Sep 26, 2021

Optimize codegen scheduling #89281

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736

rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736

tgnottingham commented Feb 4, 2021

rust-highfive commented Feb 4, 2021

tgnottingham commented Feb 4, 2021

tgnottingham commented Feb 4, 2021

rust-timer commented Feb 4, 2021

bors commented Feb 4, 2021

bors commented Feb 4, 2021

rust-timer commented Feb 4, 2021

rust-timer commented Feb 4, 2021

tgnottingham commented Feb 4, 2021

nagisa commented Feb 4, 2021

bors commented Feb 4, 2021

bors commented Feb 4, 2021

tgnottingham commented Feb 4, 2021 •

edited

Loading

SunnyWar commented Feb 5, 2021

bors commented Feb 5, 2021

tgnottingham commented Feb 5, 2021 •

edited

Loading

bors commented Feb 5, 2021

SunnyWar commented Feb 5, 2021

tgnottingham commented Feb 5, 2021

tgnottingham commented Feb 5, 2021

rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736

rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736

Conversation

tgnottingham commented Feb 4, 2021

rust-highfive commented Feb 4, 2021

tgnottingham commented Feb 4, 2021

tgnottingham commented Feb 4, 2021

rust-timer commented Feb 4, 2021

bors commented Feb 4, 2021

bors commented Feb 4, 2021

rust-timer commented Feb 4, 2021

rust-timer commented Feb 4, 2021

tgnottingham commented Feb 4, 2021

nagisa commented Feb 4, 2021

bors commented Feb 4, 2021

bors commented Feb 4, 2021

tgnottingham commented Feb 4, 2021 • edited Loading

SunnyWar commented Feb 5, 2021

bors commented Feb 5, 2021

tgnottingham commented Feb 5, 2021 • edited Loading

bors commented Feb 5, 2021

SunnyWar commented Feb 5, 2021

tgnottingham commented Feb 5, 2021

tgnottingham commented Feb 5, 2021

tgnottingham commented Feb 4, 2021 •

edited

Loading

tgnottingham commented Feb 5, 2021 •

edited

Loading