Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sccache for builds #1724

Closed
jrmuizel opened this issue Sep 19, 2017 · 21 comments
Closed

Use sccache for builds #1724

jrmuizel opened this issue Sep 19, 2017 · 21 comments

Comments

@jrmuizel
Copy link
Collaborator

mozilla/sccache#179 has some instructions.

When I last looked at these I believe I couldn't figure out how exactly to communicate the encrypted AWS key to sccache. The details are fuzzy though.

@jrmuizel
Copy link
Collaborator Author

In talking with kats I realized that we might be better off just trying to get sccache to work with task-cluster directly instead of getting it working with travis first.

@jrmuizel
Copy link
Collaborator Author

@jrmuizel
Copy link
Collaborator Author

Also:
jrmuizel: you can set env vars for sccache for AWS creds (assuming your CI won't leak them in logs), or you could have a little script to rewrite the JSON secret into ~/.aws/credentials format
ted
https:/mozilla/sccache/blob/master/src/simples3/credential.rs
ted
is the code that sccache uses to find AWS creds (forked from rusoto)

and https://dxr.mozilla.org/mozilla-central/source/taskcluster/scripts/builder/build-linux.sh#55

@glennw
Copy link
Member

glennw commented Sep 20, 2017

@jrmuizel @metajack What is the likelihood of getting someone at Mozilla who works on this stuff to look into this officially (running CI on task cluster)? I certainly don't know enough about that stuff to look into it.

We haven't been able to merge anything for ~3 days with the latest travis issues, which is following another fortnight of similar issues...

@jrmuizel
Copy link
Collaborator Author

@staktrace is looking at this a bit right now and we can probably get @luser to help before he goes on PTO next week.

@staktrace
Copy link
Contributor

At the moment I only have basic linux64 taskcluster integration working. There are two steps involved:

  1. Install the integration github app on the repo at https:/apps/taskcluster (I have this app installed for my fork, staktrace/webrender)
  2. Apply the change from my taskcluster-ci branch - this just adds the .taskcluster.yml file and tweaks one of the reftest fuzz numbers.

After doing this, each PR update or push will trigger the taskcluster job. You can see a sample one (from my taskcluster-ci commit) at https://tools.taskcluster.net/groups/GNsTmjKaQyeF5v623NM6eQ - it runs the debug and release commands on whatever the current stable rust version is. @jrmuizel said that for now the "nightly" rust commands we can just leave on travis.

If we want to pin to a specific rust version, we can update the docker image to have that rust version preinstalled, and remove the rustup commands from the .taskcluster.yml script.

Next steps are figuring out how to hook up sccache and getting OS X jobs running. I think OS X is probably more important at this point, since with that we can start using taskcluster "in production", and then work on getting it faster with sccache.

I'd like to merge my taskcluster-ci branch as soon as possible but I'm not sure what effect that will have on bors/homu and the regular workflow. We should probably set up a "maintenance window" or something where we can do the merge, and ensure things are working or roll back if they aren't. Or if there is a test repo somewhere with bors/homu we can use that to try this out.

@metajack
Copy link

sccache works fine, so I'm not sure what in paritcular is causing problems, or if your request is actually related to sccache builds or just builds being messed up in general. It's pretty easy for us to move a particular repo over to our buildbot instances if that is what is needed.

@staktrace
Copy link
Contributor

I talked to taskcluster folks and I have steps on setting up OS X worker machines for taskcluster, so that we can run our own CI farm. It's fairly straightforward and I have it running using my laptop as a test. I think we can rustle up some OS X machines in the Toronto office or hosted remotely somewhere and use them as dedicated CI machines for webrender.

As a bonus, since the OS X setup doesn't use docker, it doesn't reset the machine state after each job is run. This means even if we just use a local sccache we should get a good speedup.

@jrmuizel
Copy link
Collaborator Author

The easiest thing to do is probably just get a dedicated mac mini or two from macstadium and expense it.

@staktrace
Copy link
Contributor

I set up the worker on the mac mini that jrmuizel rented from macstadium. It seems to be working ok.

Next step to move this along is to try it on servo/webrender instead of just my clone of the repo. Whoever owns servo/webrender needs to install the github-taskcluster integration tool from https:/apps/taskcluster.

Then we need to get :jonasfj to add the necessary scopes to this repo, so that it can spawn the "kats-webrender-ci-osx" type worker via the "localprovisioner" TC provisioner.

And then after that I can make a PR from my branch with the .taskcluster.yml file and see how bors/homu deal with it.

@glennw
Copy link
Member

glennw commented Sep 21, 2017

@metajack @larsbergstrom Is ^ something you can help with (enabling the TC tool on the WR repo)?

@metajack
Copy link

Should be done.

@staktrace
Copy link
Contributor

Thanks. I got :jonasfj to add the scopes as well so we should be good to try the PR. I'll submit that shortly.

While I was waiting I installed sccache on the OS X worker but I ran into a rustc internal compiler error when building WR with it. I'll investigate that more but for now let's do this without sccache.

@luser
Copy link
Contributor

luser commented Sep 22, 2017

We've seen that error before elsewhere:

thread 'rustc' panicked at 'failed to acquire jobserver token: Error { repr: Os { code: 35, message: "Resource temporarily unavailable" } }', src/libcore/result.rs:860:4

I think this is because cargo creates a make-style jobserver now, and it will pass it down to rustc (for use when you use codegen-units=N). There's some weird interaction here with how the jobserver fd gets passed down and I don't quite understand it.

@staktrace
Copy link
Contributor

Yeah, I just commented in rust-lang/rust#42867 which appears to be tracking this problem.

@staktrace
Copy link
Contributor

Quick update: I made PR #1746 to get the .taskcluster.yml file merged into the webrender repo. By default this will run the CI jobs via taskcluster for PRs by "collaborators" and for pushes. (We need to set allowPullRequests: public to make it run on PRs by anybody, see documentation). I did this intentionally since until we get everything hooked up it's not too useful to run the jobs on every random PR.

I looked at the bors and homu code/docs to figure out exactly what it is they do and what integration we need there. It seems like when we run CI with travis it notifies the result to the bots via webhooks. AFAICT taskcluster-github doesn't have webhook capability yet so we can either request that and wait for it, or just make the CI command itself call out to a webhook and report the success/failure.

The other important thing is that homu right now runs tests on the merge commit from the PR and latest master. So that means we need some way of triggering the taskcluster run from homu, the same way it triggers travis/appveyor runs. I haven't looked into if this is possible yet, it might be a feature that we need to request of the taskcluster-github integration tool. Have an API to do this will also allow us to make things like retry requests work. Right now retrying has to be done manually via the taskcluster task page, and even then it won't update the final status of the build (I filed bug 1402136 for this).

And finally, one more thing that would be nice is if taskcluster canceled obsolete jobs if e.g. somebody pushes new commits into a PR. It doesn't do this yet and it's an optimization but one that would be good to have. I filed bug 1402884 for this.

bors-servo pushed a commit that referenced this issue Sep 25, 2017
Add a .taskcluster.yml file to run CI using taskcluster

This is a test PR to see if (a) taskcluster correctly picks up the PR and schedules the CI jobs and (b) to see how bors/homu deal with this extra CI job.

This is related to #1724

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/1746)
<!-- Reviewable:end -->
@glennw
Copy link
Member

glennw commented Sep 26, 2017

Wow, nice work. The TC builds are so fast compared to how long the normal builds take!

@staktrace
Copy link
Contributor

We need to set allowPullRequests: public to make it run on PRs by anybody, see documentation.

This is done now, in #1789.

The other important thing is that homu right now runs tests on the merge commit from
the PR and latest master.

I realized that there's no special magic needed to make this work. It looks like the merge head is pushed as the auto branch on the repo, and since it's a branch update the taskcluster CI runs on it automatically. So that's one less thing that needs to be done.

I think really the next thing we want to do here is set up a webhook equivalent for taskcluster, so that it can notify the bots on success/failure. And then have the bots accept travis || taskcluster as success conditions for landing the merge.

@staktrace
Copy link
Contributor

#1871 adds "routes" to the .taskcluster.yml file which will allow us to listen for task-completion notifications. We would need code running somewhere that listens for the four tasks for a particular PR to complete successfully and uses that as a success condition for landing the PR. This can either be added to homu directly or run as a separate service that simulates a travis webhook, or something. With the mozillapulse python library doing most of the work it shouldn't be too hard to glue things together.

@glennw
Copy link
Member

glennw commented Jan 16, 2018

We are running CI on TaskCluster now. Do we still need this open @staktrace @jrmuizel ?

@staktrace
Copy link
Contributor

I think we can close it. Until rust-lang/rust#42867 is solved we probably won't get sccache to work on the OS X builder anyway and it might not be worth the effort unless we start building up a backlog again.

@glennw glennw closed this as completed Jan 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants