Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PGO in the compiler's build system #79562

Closed
michaelwoerister opened this issue Nov 30, 2020 · 25 comments
Closed

Support PGO in the compiler's build system #79562

michaelwoerister opened this issue Nov 30, 2020 · 25 comments
Labels
C-discussion Category: Discussion or questions that doesn't represent real issues. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance

Comments

@michaelwoerister
Copy link
Member

One of the first steps in bringing PGO to rustc is adding support for it to the compiler's build system. Concretely this means

  • support building an instrumented compiler (-Cprofile-generate phase), and
  • support building a compiler that makes use of profile data (-Cprofile-use phase)

Optionally, the build system could also

  • support profile data collection (by running the instrumented compiler through a standard set of workloads)

In rust-lang/cargo#7618 (comment), @luser suggests that the three phases should be operable separately. I think that makes a lot of sense since on CI we'll probably want to either only run the "profile-use" phase with downloaded profile data, or only the "profile-generate" phase that uploads the profile data.

I suggest adding the following new config.toml settings:

  • llvm.profile-generate = ( "<path>" | "default-path" )
  • llvm.profile-use = ( "<path>" | "default-path" | "download-profdata" )
  • rust.profile-generate = ( "<path>" | "default-path" )
  • rust.profile-use = ( "<path>" | "default-path" | "download-profdata" )
  • rust.profile-collect = (true | false)

The paths specify the argument to the -Cprofile-generate and -Cprofile-use flags. default-path means some known location within the build directory. download-profdata means that the build system will try to find and download the most recent profile data available for the given commit. If rust.profile-collect is true, the build system will run a standard data collection scenario with the instrumented compiler (or will error if none of profile-generate flags is set).

Note that this setup has the following implications:

  • One can turn PGO on or off independently for LLVM and Rust.
  • There is no single command that runs all three phases at once. If one wants to build a PGOed compiler completely locally, one has to run something like:
    ./configure 
      --llvm.profile-generate=default-path
      --rust.profile-generate=default-path
      --llvm.profile-collect=true
    
    ./x.py build
    
    ./x.py clean # including LLVM?
    
    ./configure 
      --llvm.profile-use=default-path
      --rust.profile-use=default-path
    
    ./x.py build
  • The "default" scenario where "generate" and "use" phases are not run on the same machine can be done with a single ./configure && ./x.py build though.

I'd like to hear if somebody can come up with a different, better setup. Maybe something that allows the entire three-phase build with a single ./x.py pgo-build command?

@michaelwoerister michaelwoerister added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) C-discussion Category: Discussion or questions that doesn't represent real issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-release Relevant to the release subteam, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Nov 30, 2020
@jyn514 jyn514 added the WG-compiler-performance Working group: Compiler Performance label Nov 30, 2020
@Mark-Simulacrum
Copy link
Member

I don't think we need to optimize for full collection + use, I'm not even sure how many people would use it if available -- it's just not really a common thing I expect.

One question is that for the Rust part of the PGO, I imagine we don't want to collect or use the PGO data for the first round of compilation, right? Since that'll have differing codegen from the second round (potentially even using a different LLVM version). That's fine -- rustbuild can handle that for us -- but something to keep in mind.

If we don't support simultaneously using and generating PGO data we should probably check that at startup but I'm not otherwise all that worried.

@Mark-Simulacrum Mark-Simulacrum removed the T-release Relevant to the release subteam, which will review and decide on the PR/issue. label Nov 30, 2020
@michaelwoerister
Copy link
Member Author

One question is that for the Rust part of the PGO, I imagine we don't want to collect or use the PGO data for the first round of compilation, right? Since that'll have differing codegen from the second round (potentially even using a different LLVM version). That's fine -- rustbuild can handle that for us -- but something to keep in mind.

Hm, I wonder if it would be possible to make stage 1 instrumented, use it to collect profile data, and then make stage 2 use that data. The profdata format looks rather robust, so that might even work for different LLVM versions. But I would not want to rely on it.

Are our release artifacts stage 2 builds? Then the instrumented compiler should also be a stage 2 build.

@Mark-Simulacrum
Copy link
Member

Right, stage 2 is basically the artifacts we'd be using for this, though they're built as "stage 1 artifacts" -- the second rustc binary that gets produced. I'm not too worried about getting the language precise here, I think we have agreement on what we need to profile :)

I expect that we can build stage 0 artifacts as normal -- no need for special PGO handling, though eventually the beta they're built with will be PGO'd so that should give a nice speed up. Then the second round of compilation (stage 1 artifacts) is built, instrumented, we run our workloads to gather data, and then rebuild the second round of compilation using that data. This gives us a single compiler rebuild in terms of overhead.

Thinking along those lines, perhaps the right thing is actually for rustbuild to always do this - i.e., the pgo-build option you suggest is not actually necessary. Specifically, rustbuild will collect data if the profile-generate options are given and use it with profile-use; just profile-use will only use it, and both will do the full thing.

@luser
Copy link
Contributor

luser commented Nov 30, 2020

Hm, I wonder if it would be possible to make stage 1 instrumented, use it to collect profile data, and then make stage 2 use that data. The profdata format looks rather robust, so that might even work for different LLVM versions. But I would not want to rely on it.

If this works on a technical level with the profile data format then this sounds like a pretty great idea. You've already got a multi-phase build, so reusing one of those phases to generate the profile data would be a clever trick! Per my comment elsewhere, this would impact the reproducibility of the builds. To preserve that it would be nice to publish the profile data as a build artifact and provide rustbuild support for producing an optimized build given an existing set of profile data.

@michaelwoerister
Copy link
Member Author

I expect that we can build stage 0 artifacts as normal -- no need for special PGO handling, though eventually the beta they're built with will be PGO'd so that should give a nice speed up. Then the second round of compilation (stage 1 artifacts) is built, instrumented, we run our workloads to gather data, and then rebuild the second round of compilation using that data. This gives us a single compiler rebuild in terms of overhead.

Let me see if I understand this correctly: Right now we have:

beta --> stage0 --> stage1
                    ^^^^^^
                     dist

that would become:

beta --> stage0 --> stage1 --> stage2
                    ^^^^^^     ^^^^^^
                instrumented    PGO-ed dist

So, PGO would mean one additional build, right?

@michaelwoerister
Copy link
Member Author

michaelwoerister commented Dec 3, 2020

I'm wondering if that isn't actually too complicated since we'd actually never use that build in CI. What I suggested originally would look like:

# case 1: generate profile data

beta --> stage0 --> stage1 =run=> profdata (for uploading)
                    ^^^^^^
               dist (instrumented)


# case 2: use profile data (downloaded, from some previous build)

beta --> stage0 --> stage1
                    ^^^^^^
               dist (profile-use)

Cases (1) and (2) would always run in parallel, and case (2) would use profile data from a previous run of case (1). The advantage is that we don't have to worry about building LLVM twice.

@michaelwoerister
Copy link
Member Author

Another wrinkle here that I just realized is that LLVM is built only once -- so for the profile-generate case we would work with an instrumented LLVM already when building the instrumented rustc. That should not be much of a problem though since instrumentation can be disabled at runtime via an environment variable, if I remember correctly. Or we just collect profile data for LLVM twice. The corresponding .profraw are automatically updated in place by the instrumentation runtime.

@Mark-Simulacrum
Copy link
Member

I think your layout makes sense for the case where someone sets the profile use config but not profile generate. If both are set, I think it makes sense to do the layout in #79562 (comment). If sccache does not properly handle PGO flags for caching then we'll need to leave LLVM out of scope for now, otherwise I expect there should be no problem with building LLVM twice for the immediate reuse case. I am still a bit skeptical about using old profile data wrt to perf.rlo, but I think we can tackle that problem when we get to it.

@luser
Copy link
Contributor

luser commented Dec 14, 2020

sccache should handle clang's PGO flags fine since Firefox builds use them. However, it won't cache the optimized -fprofile-instr-use compile, only the -fprofile-instr-generate one, so that may not help you as much as you'd like. It's possible that we could add support for caching those compiles to sccache by including the hash of the profile data in the cache key. We never considered that previously because the Firefox build always runs a profiling run for every PGO build so the profile data would never be the same for two different builds anyway.

Interestingly, reading the clang documentation reminded me that LLVM added support for using the output of a sampling profiler as the profiling input for PGO. This is likely out of scope as it's not interoperable with the instrumentation-based profile data and thus wouldn't fit neatly into what's already been implemented in rustc, but it's an interesting thing to consider since it doesn't require building instrumented binaries.

@michaelwoerister
Copy link
Member Author

I am still a bit skeptical about using old profile data wrt to perf.rlo, but I think we can tackle that problem when we get to it.

I don't think we should use the PGOed compiler for perf.rlo. It would just be confusing otherwise.

@michaelwoerister
Copy link
Member Author

However, it won't cache the optimized -fprofile-instr-use compile

Oh, good to know! Yes, support for that would have to be added to sccache. The -fprofile-instr-use case would be the more common one, I think.

@michaelwoerister
Copy link
Member Author

I think your layout makes sense for the case where someone sets the profile use config but not profile generate. If both are set, I think it makes sense to do the layout in #79562 (comment).

I think I'd prefer if we just didn't support doing a PGOed build with a single ./x.py invocation. It's a rare use case and it should be easy enough to provide a helper script somewhere that executes the steps described in the OP, right?

@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Dec 15, 2020

While working on #80033, I ran into LLVM Profile Warning: Unable to track new values: Running out of static counters. Consider using option -mllvm -vp-counters-per-site=<n> to allocate more value profile counters at compile time. - I ended up using 4 for the value, which seems to have fixed this for now, but it would be good to know what exactly we're specifying here (and what the tradeoffs are); ideally I'd want that warning to be an error I imagine?

@luser
Copy link
Contributor

luser commented Dec 15, 2020

I don't think we should use the PGOed compiler for perf.rlo. It would just be confusing otherwise.

I owe you a longer-form writeup, but for Firefox we had perf testing on both PGO and non-PGO builds, and it was possible to compare the results between them. This was really useful because the PGO builds were what we were actually shipping to users, but being able to compare between the two was good and also allowed for quicker iteration when the PGO perf difference didn't matter. (IIRC enabling PGO builds was opt-in for Firefox try pushes.)

@michaelwoerister
Copy link
Member Author

@luser Yes, it would be great if we could benchmark both PGO and non-PGO builds. If we can afford that (which is something @Mark-Simulacrum would know better than me) then all the better. If we had to choose one of the two, I'd opt for benchmarking just non-PGO builds since they would not contain the "random" effects of partially out-dated profiling data.

@michaelwoerister
Copy link
Member Author

michaelwoerister commented Dec 17, 2020

While working on #80033, I ran into LLVM Profile Warning: Unable to track new values: Running out of static counters. Consider using option -mllvm -vp-counters-per-site=<n> to allocate more value profile counters at compile time. - I ended up using 4 for the value, which seems to have fixed this for now, but it would be good to know what exactly we're specifying here (and what the tradeoffs are); ideally I'd want that warning to be an error I imagine?

This is what I've been able to find out so far: "value profiling" allows to record some of the actual values that a variable takes during execution. To do that the profiling runtime has to reserve some space in the profdata. Apparently one does not need many of these counters in the common case:

https:/llvm/llvm-project/blob/37f99a56065209627020428cecb78f55dfa90580/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp#L81-L84

It looks like an overflow here can safely be ignored. It might just lead to less accurate profile data:

https:/llvm/llvm-project/blob/e7a3c4c11e84ba99c3682ae6cf20c398f16cf3f5/compiler-rt/lib/profile/InstrProfilingValue.c#L210-L212

But it also looks like increasing the number of counters via -vp-counters-per-site should not have much of an adverse effect either, except for a little more memory being used by the instrumented binary.

@Mark-Simulacrum
Copy link
Member

Ok, useful to know. I don't feel like it's critical that we avoid the warnings but there seems to not be any significant disadvantage to bumping slightly so I'll leave that in the code for now.

@michaelwoerister
Copy link
Member Author

@luser What's the best way of getting support for -fprofile-use into sccache? Should I open an issue? Or try to implement it myself and make a PR?

@luser
Copy link
Contributor

luser commented Jan 25, 2021

Opening an issue would be a good start, but the project is a bit under-maintained right now so it's likely that doing the work yourself would be the quickest path to success. I'd be happy to give you pointers on how to implement it. I'm on the Rust Discord as tedmielczarek, feel free to drop me a line.

@michaelwoerister
Copy link
Member Author

FYI: I've made a PR to sccache that implements the missing functionality. I haven't heard back from the maintainers yet.

@FilipAndersson245
Copy link

Any update on this issue?

@michaelwoerister
Copy link
Member Author

The required sccache changes have not been merged yet.

@Mark-Simulacrum
Copy link
Member

We ended up foregoing the caching for the time being; LLVM and rustc are both PGO'd on x86_64-unknown-linux-gnu as of #88069.

The sccache changes would still be helpful, though.

@jyn514
Copy link
Member

jyn514 commented Jul 11, 2022

The sccache changes have been merged.

@Mark-Simulacrum
Copy link
Member

I think at this point this issue can be closed; we've long had PGO for x86_64-unknown-linux-gnu, and are in the process of adding it to Windows (#96978).

#79442 is a better issue here regardless, with a much more complete task list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-discussion Category: Discussion or questions that doesn't represent real issues. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance
Projects
None yet
Development

No branches or pull requests

5 participants