Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc_codegen_ssa: tune codegen scheduling to reduce memory usage #81736

Merged
merged 1 commit into from
Feb 5, 2021

Conversation

tgnottingham
Copy link
Contributor

For better throughput during parallel processing by LLVM, we used to sort
CGUs largest to smallest. This would lead to better thread utilization
by, for example, preventing a large CGU from being processed last and
having only one LLVM thread working while the rest remained idle.

However, this strategy would lead to high memory usage, as it meant the
LLVM-IR for all of the largest CGUs would be resident in memory at once.

Instead, we can compromise by ordering CGUs such that the largest and
smallest are first, second largest and smallest are next, etc. If there
are large size variations, this can reduce memory usage significantly.

For better throughput during parallel processing by LLVM, we used to sort
CGUs largest to smallest. This would lead to better thread utilization
by, for example, preventing a large CGU from being processed last and
having only one LLVM thread working while the rest remained idle.

However, this strategy would lead to high memory usage, as it meant the
LLVM-IR for all of the largest CGUs would be resident in memory at once.

Instead, we can compromise by ordering CGUs such that the largest and
smallest are first, second largest and smallest are next, etc. If there
are large size variations, this can reduce memory usage significantly.
@rust-highfive
Copy link
Collaborator

r? @davidtwco

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 4, 2021
@tgnottingham
Copy link
Contributor Author

@rustbot label T-compiler I-compilemem

@rustbot rustbot added I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 4, 2021
@tgnottingham
Copy link
Contributor Author

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 4, 2021
@bors
Copy link
Contributor

bors commented Feb 4, 2021

⌛ Trying commit 29711d8 with merge 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b...

@bors
Copy link
Contributor

bors commented Feb 4, 2021

☀️ Try build successful - checks-actions
Build commit: 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b (0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b)

@rust-timer
Copy link
Collaborator

Queued 0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b with parent e708cbd, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (0d4c73fdf92ce3daf07991bde0444eb0b5d8ae9b): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 4, 2021
@tgnottingham
Copy link
Contributor Author

This change completely depends on the CGU size distribution, so not seeing an improvement across the board is expected. Also, ignore keccak-debug and keccak-opt RSS stats. They vary +/- 15% from run to run.

Results are more consistent for non-incremental full builds. This change benefits crates with large variations in CGU sizes, and my guess is that full builds produce larger variations.

It seems reasonable that they would, because they do more CGU merging than incremental builds. Supposing all CGUs are roughly the same size to start, then unless the number of CGUs is just right, merging causes us to have one set of CGUs of size N and another of size 2N. But even if they don't start out the same size, the merging process can get us to a point where the CGUs do end up being roughly the same size. And we end up in the same place from there.

On my system, this change reduces peak memory usage while compiling rustc_middle by a whopping 500MB, for both incremental and non-incremental. That crate has some very outsized CGUs, so the change is particularly helpful there.

@nagisa
Copy link
Member

nagisa commented Feb 4, 2021

@bors r+

Pretty nice improvements! I don't see any egregious comptime regressions, in particular its reassuring that the bootstrap wall time is -0.1% overall.

@bors
Copy link
Contributor

bors commented Feb 4, 2021

📌 Commit 29711d8 has been approved by nagisa

@bors
Copy link
Contributor

bors commented Feb 4, 2021

🌲 The tree is currently closed for pull requests below priority 1000. This pull request will be tested once the tree is reopened.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 4, 2021
@tgnottingham
Copy link
Contributor Author

tgnottingham commented Feb 4, 2021

Thanks @nagisa!

One thing I didn't address with this change is the fact that as we add compiled CGUs to the optimization queue, LLVM threads pick the larger CGUs to work on first. I suspect this doesn't affect memory usage much, because even if we processed them in codegen order, the LLVM threads would finish with the small CGUs quickly, and we'd end up with the largest CGUs being processed concurrently anyway. But it's something to experiment with.

Edit: I think there's more room for improvement than I initially thought. After all, this change doesn't directly address the issue mentioned above (chewing through small CGUs quickly so that we still end up with the largest CGUs in memory). It can help by delaying introduction of large CGUs for a bit, which sometimes means that a large CGU being processed gets finished with and dropped before another large CGU is codegen'd. But I'm sure we can do better than that in many cases.

@SunnyWar
Copy link

SunnyWar commented Feb 5, 2021

This is fascinating. I did some work years ago on optimal scheduling and bin packing is an easy and often used solution. I'm intrigued by the premise of this PR. Consequently, I wrote a little program to explore various schedule schemes to see how well they might work while also evaluating memory pressure. But, to match my model to this work I need to know some data.

  1. How many threads is work be dispatched to, typically?
  2. How large are the CGU's and what is the typical range of values?
  3. How many CGU's in total usually are scheduled?

@bors
Copy link
Contributor

bors commented Feb 5, 2021

⌛ Testing commit 29711d8 with merge 730d6df...

@tgnottingham
Copy link
Contributor Author

tgnottingham commented Feb 5, 2021

@SunnyWar, thanks for the bin-packing reference. I assumed there had to be some well-studied formalization of the problem or related problems. My hope was that the simple approach in this PR would improve the situation until I or someone could put the time into a better approach.

1. How many threads is work be dispatched to, typically?

The number of CPUs or hyperthreads on the system. No consideration of the amount of system memory is made, so as you can imagine, this has potential to cause problems on high CPU count systems without a lot of memory.

2. How large are the CGU's and what is the typical range of values?

We have two size estimates that we base scheduling decisions on. The first is the number of statements in the MIR of the CGU. The second is the time it takes to codegen the MIR into the initial, unoptimized LLVM-IR. We have the first estimate for all CGUs before we start scheduling. The second estimate is only available as we go along codegening CGUs to LLVM-IR.

I don't know what's typical. It can vary wildly from crate to crate and depend on compilation mode.

3. How many CGU's in total usually are scheduled?

Depends on the mode of compilation: typically 16 for non-incremental builds and up to 256 for incremental builds. By default, that means 16 for release builds and up to 256 for debug builds.

Also, things are slightly more complex than this, as there are different kinds of work items that can be in the queue, there are some different phases in the process, etc. I'm just getting familiar with the area TBH.

@bors
Copy link
Contributor

bors commented Feb 5, 2021

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing 730d6df to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 5, 2021
@bors bors merged commit 730d6df into rust-lang:master Feb 5, 2021
@rustbot rustbot added this to the 1.51.0 milestone Feb 5, 2021
@SunnyWar
Copy link

SunnyWar commented Feb 5, 2021

@tgnottingham I did a quick check assuming only 8 threads with 256 tasks. The "size" of the task was evenly distributed 1-256. The model also assumes the runtime of a task is directly proportional to its size.
Using your method the result is only a 0.2% reduction in memory pressure with a 5.7% increase in runtime. This makes sense as the smaller tasks finish quickly and very soon you have all the largest tasks filling all the threads. However, I also modeled (as a control) a random distribution. This resulted in a 15.5% reduction in memory pressure with a 5.8% increase in runtime.
These are all theoretical numbers based on a possibly flawed simulation with completely made-up data. But it makes me wonder if you have considered trying something like randomizing the data and see how it does? This also has me thinking of an algorithm to more deterministically "flatten" the memory pressure, which I'll probably fiddle with this weekend.
Thanks for bringing this fascinating problem to my attention.

@tgnottingham
Copy link
Contributor Author

@SunnyWar, awesome! If you'd like to make your simulation more accurate, feel free to ask more questions, or you can go straight to the source in compiler/rustc_codegen_ssa/src/base.rs and compiler/rustc_codegen_ssa/src/back/write.rs.

One thing I neglected to mention is that we actually have more codegen'd CGUs in memory than there are LLVM threads running, so that when an LLVM thread finishes optimizing one CGU, there's already another codegen'd CGU ready for it to work on in the queue. Long story short, we basically ramp up to keeping 1.5 x ncpus codegen'd CGUs in memory at once during the bulk of processing. This is another thing I'm sure we'd want to make more intelligent, but that's the state of things.

By the way, I have a change up for review (#81538) that will show the CGU cost estimates in -Z time-passes output. For example, 502 is the cost estimate for the CGU corresponding to this time-passes entry:

time:   0.013; rss:  178MB ->  201MB (  +23MB)  codegen_module(1wpvv7qlzclv3mr5, 502)

If that lands, it will be easy for you to see real cost estimate distributions by compiling crates with the nightly compiler (cargo +nightly rustc -- -Z time-passes).

@tgnottingham
Copy link
Contributor Author

By the way, here's a contrived example that shows how profitable work in this area could be.

Suppose we have 2 CPUs, 2 jobs of size 8N, 8 jobs of size N, and we ignore a ton of details.

We used to schedule the work something like this:

      Time -->
CPU0: |------8N------||-N||-N||-N||-N|
CPU1: |------8N------||-N||-N||-N||-N|

This utilizes CPUs ideally, and so minimizes runtime (supposing the high memory usage doesn't have detrimental effects). But it maximizes memory usage by working on the largest CGUs concurrently. We'll call the peak memory cost for this schedule 16N.

We could have utilized CPUs ideally, while minimizing memory usage, with this schedule:

      Time -->
CPU0: |------8N------||------8N------|
CPU1: |-N||-N||-N||-N||-N||-N||-N||-N|

The peak memory cost here is 9N, a 44% reduction in memory usage versus the first approach.

Of course, we can't always minimize memory usage and maximize throughput. Any solution to this scheduling problem will need to make good tradeoffs. E.g. if we can reduce memory usage by 30% at a 1% cost to runtime, it's probably the right tradeoff. This is especially true when it enables us to parallelize more (both within rustc and across multiple rustcs spawned by cargo) without risking swapping or thrashing, because then we get a runtime improvement too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants