Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High rate of spurious CI failures on macOS machines #14459

Closed
fmeum opened this issue Dec 21, 2021 · 10 comments
Closed

High rate of spurious CI failures on macOS machines #14459

fmeum opened this issue Dec 21, 2021 · 10 comments

Comments

@fmeum
Copy link
Collaborator

fmeum commented Dec 21, 2021

Description of the problem / feature request:

I run daily CI checks in my rulesets' GitHub Actions pipeline. The macOS pipelines, running on macos-latest, fail every few days with two kinds of spurious failures that I have never been able to reproduce locally:

Issue 1:

Starting local Bazel server and connecting to it...
... still trying to connect to local Bazel server after 10 seconds ...
... still trying to connect to local Bazel server after 20 seconds ...
... still trying to connect to local Bazel server after 30 seconds ...
... still trying to connect to local Bazel server after 40 seconds ...
... still trying to connect to local Bazel server after 50 seconds ...
... still trying to connect to local Bazel server after 60 seconds ...
... still trying to connect to local Bazel server after 70 seconds ...
... still trying to connect to local Bazel server after 80 seconds ...
... still trying to connect to local Bazel server after 90 seconds ...
... still trying to connect to local Bazel server after 100 seconds ...
... still trying to connect to local Bazel server after 110 seconds ...
FATAL: couldn't connect to server (1753) after 120 seconds.
Error: Process completed with exit code 37.

Issue 2:

ERROR: /Users/runner/work/rules_jni/rules_jni/tests/libjvm_stub/BUILD.bazel:116:12: Target '//libjvm_stub:HelloFromJava' depends on toolchain '@local_config_cc//:cc-compiler-darwin', which cannot be found: error loading package '@local_config_cc//': cannot load '@local_config_cc_toolchains//:osx_archs.bzl': no such file'
ERROR: Analysis of target '//libjvm_stub:HelloFromJava' failed; build aborted: Analysis failed

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I have no way to consistently reproduce this issue, but it happens every few days on rules_jni's CI schedule.

What operating system are you running Bazel on?

macOS 10.15

What's the output of bazel info release?

Over time, I have hit the issues on 4.2.2, 5.0.0rc3 and various last_green builds.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

I can make arbitrary changes to the CI config if that helps to gather more information on the cause of these issues.

@thii
Copy link
Member

thii commented Dec 21, 2021

I believe it was waiting for xcode-locator, that can be super long if the machine has many Xcodes.

Try this hack: https://www.smileykeith.com/2021/03/08/locking-xcode-in-bazel/

@fmeum
Copy link
Collaborator Author

fmeum commented Dec 21, 2021

Thanks, I will try that. Do you think that this could also be related to Issue 2?

@thii
Copy link
Member

thii commented Dec 21, 2021

It could be, since that file is only created if Xcode is successfully detected.

repository_ctx.symlink(paths["@bazel_tools//tools/osx/crosstool:osx_archs.bzl"], "osx_archs.bzl")

@sventiffe
Copy link
Contributor

@fmeum did @thii's suggestion fix the issue?

@fmeum
Copy link
Collaborator Author

fmeum commented Jan 14, 2022

I just checked again and found no new failures in the two weeks since I started using a checked-in xcode_version_config. I will close this now as the workaround seems to have worked and open a new issue if it should show up again.

@fmeum fmeum closed this as completed Jan 14, 2022
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 15, 2022

@sventiffe Reopening since the exact same issue appeared again today over at rules_jni.

I started pinning with --xcode_version_config in fmeum/rules_jni@b502180. Today, the runs on macos-latest with Bazel 4.0.0 and Bazel 5.0.0rc3 failed. I attached the CI logs, the failing step is here with the message (in the case of Bazel 5.0.0rc3 with bzlmod):

2022/01/15 00:35:14 Downloading https://releases.bazel.build/4.0.0/release/bazel-4.0.0-darwin-x86_64...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Invocation ID: 0e7ba86d-1a19-4b3a-a770-f0c3d58d166f
Loading: 
Loading: 0 packages loaded
Loading: 11 packages loaded
    currently loading: analysis/cc_jni_library ... (2 packages)
Loading: 11 packages loaded
    currently loading: analysis/cc_jni_library ... (2 packages)
Analyzing: 21 targets (13 packages loaded, 0 targets configured)
Analyzing: 21 targets (15 packages loaded, 1 target configured)
Analyzing: 21 targets (18 packages loaded, 8 targets configured)
Analyzing: 21 targets (18 packages loaded, 8 targets configured)
Analyzing: 21 targets (19 packages loaded, 8 targets configured)
Analyzing: 21 targets (21 packages loaded, 9 targets configured)
Analyzing: 21 targets (21 packages loaded, 9 targets configured)
Analyzing: 21 targets (21 packages loaded, 9 targets configured)
Analyzing: 21 targets (21 packages loaded, 9 targets configured)
Analyzing: 21 targets (21 packages loaded, 9 targets configured)
Analyzing: 21 targets (38 packages loaded, 206 targets configured)
Analyzing: 21 targets (38 packages loaded, 206 targets configured)
Analyzing: 21 targets (38 packages loaded, 206 targets configured)
ERROR: /Users/runner/work/rules_jni/rules_jni/tests/native_loader/src/test/java/com/example/math/BUILD.bazel:3:17: Target '//native_loader/src/test/java/com/example/math:math_remove_this_part_' depends on toolchain '@local_config_cc//:cc-compiler-darwin', which cannot be found: error loading package '@local_config_cc//': cannot load '@local_config_cc_toolchains//:osx_archs.bzl': no such file'
ERROR: Analysis of target '//native_loader/src/test/java/com/example/math:math' failed; build aborted: Analysis failed
INFO: Elapsed time: 216.940s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (39 packages loaded, 206 targets configured)
ERROR: Couldn't start the build. Unable to run tests
FAILED: Build did NOT complete successfully (39 packages loaded, 206 targets configured)
Error: Process completed with exit code 1.

@thii Are there any additional diagnostics that I could enable that would help diagnose the underlying issue?

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 10, 2022

@keith I just encountered another failure and attached the the logs. They include the content of the BUILD file, which turns out not to be empty.

@keith
Copy link
Member

keith commented Feb 10, 2022

That file actually looks pretty good to me, since yours didn't fail with osx_archs import, it's because that file doesn't contain darwin https:/bazelbuild/bazel/blob/master/tools/osx/crosstool/osx_archs.bzl

I'm not sure where that's coming from but it should be darwin_x86_64 probably instead

@fmeum
Copy link
Collaborator Author

fmeum commented Feb 11, 2022

That file actually looks pretty good to me, since yours didn't fail with osx_archs import, it's because that file doesn't contain darwin https:/bazelbuild/bazel/blob/master/tools/osx/crosstool/osx_archs.bzl

I'm not sure where that's coming from but it should be darwin_x86_64 probably instead

Nice catch. I think I found the root cause and submitted a fix as #14796.

@keith
Copy link
Member

keith commented Feb 11, 2022

Nice, I'll keep an eye on that. My assumption for why this is similar to the flaky CI case is because the toolchain only falls back to this codepath on macOS when Xcode cannot be found, which could happen in the case of a timeout running the discovery logic for that. This would likely workaround that issue if folks needed full Xcode in their cases https://www.smileykeith.com/2021/03/08/locking-xcode-in-bazel/

ckolli5 added a commit that referenced this issue Jun 22, 2022
Previously, if the xcode_locator failed and cc_autoconf_toolchain used
the non-Xcode C++ toolchain as a fallback, its reference to
`@local_config_cc//:cc-compiler-darwin`, where darwin is the legacy cpu
value for x86_64 macOS, would be invalid.

Fixes #14459

Closes #14796.

PiperOrigin-RevId: 451860477
Change-Id: Iec115f600ebb7ac0786b2169276d25e3ff5d54bf

Co-authored-by: Fabian Meumertzheim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants