Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel 5.0.0rc3 crashes with local and remote actions being done #14433

Closed
philsc opened this issue Dec 15, 2021 · 40 comments
Closed

bazel 5.0.0rc3 crashes with local and remote actions being done #14433

philsc opened this issue Dec 15, 2021 · 40 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@philsc
Copy link
Contributor

philsc commented Dec 15, 2021

Description of the problem / feature request:

I'm trying out 5.0.0rc3 in our CI environment and saw the following crash:

(11:09:44) INFO: Caught InterruptedException from ExecException for remote branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o, which may cause a crash.
--
  | (11:10:01) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
  | at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.waitBranches(DynamicSpawnStrategy.java:345)
  | at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.exec(DynamicSpawnStrategy.java:733)
  | at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
  | at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
  | at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1430)
  | at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
  | at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
  | at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
  | at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
  | at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  | at java.base/java.lang.Thread.run(Unknown Source)
  | (11:10:03) Failed with return code 37.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This is just setting .bazelversion to 5.0.0rc3 and running it on our entire build. This involves a remote buildfarm cluster.
I don't know of a "simple" way to reproduce this.

What operating system are you running Bazel on?

Everything's on x86_64 Linux.

What's the output of bazel info release?

$ bazel info release
Starting local Bazel server and connecting to it...
release 5.0.0rc3

Have you found anything relevant by searching the web?

I couldn't find anything pertinent.

Any other information, logs, or outputs that you want to share?

Not at this time.

@philsc philsc changed the title bazel 5.0.0rc3 crashes with local and remote actions being cancelled bazel 5.0.0rc3 crashes with local and remote actions being done Dec 15, 2021
@philsc
Copy link
Contributor Author

philsc commented Dec 15, 2021

/cc @Wyverald

@coeuvre
Copy link
Member

coeuvre commented Dec 15, 2021

Is this caused by manual interruptions during the build?

@coeuvre
Copy link
Member

coeuvre commented Dec 15, 2021

/cc @larsrc-google

@Wyverald Wyverald added this to the Bazel 5.0 Release Blockers milestone Dec 15, 2021
@Wyverald Wyverald added under investigation team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Dec 15, 2021
@philsc
Copy link
Contributor Author

philsc commented Dec 15, 2021

Is this caused by manual interruptions during the build?

I don't see anything in our logs that would indicate a manual interruption during the build.

EDIT: It's certainly possible that this is a fluke (or maybe a bug in our CI system), but I figured it's worth posting in case it isn't. I've never encountered this error before in our CI system.

@jlaxson
Copy link
Contributor

jlaxson commented Dec 15, 2021

I'm seeing this also on darwin, definitely no user interruption.

@larsrc-google
Copy link
Contributor

What's the most recent version you're not seeing this issue with?

@larsrc-google
Copy link
Contributor

I submitted a fix for this kind of problem in fe1e25b, and the further changes to src/main/java/com/google/devtools/build/lib/dynamic/DynamicSpawnStrategy.java should not be able to cause this. However, 04754ef can increase the number of actions getting executed without a delay, which could cause this kind of problem to become more common. Before that, 1962a59 did a significant changes, but that was picked into 4.2.0.

If you want to help reproduce it, you can increase the chance of hitting this case by setting --experimental_local_execution_delay=0 and reducing the number of jobs to ~#CPUs, which forces a lot of races. If you're able to reproduce this, could you check the commits above?

@jlaxson Are you seeing that on a CI system or with local builds?
@philsc I'm slightly surprised that your CI uses dynamic execution. I would have expected purely remote execution.

@jlaxson
Copy link
Contributor

jlaxson commented Dec 16, 2021

I'm seeing it locally on 5.0.0rc3, our CI is all remote with --remote_download_minimal. Anecdotally I don't remember it on rc2, but not sure if I was using dynamic execution for much of that time.

But yesterday, I think I hit it about ten times in a row with maybe 1 dirty cpp file and a linking step. I'll dev on 5.0.0-pre.20210708.4 today and see what happens.

@coeuvre
Copy link
Member

coeuvre commented Dec 22, 2021

@jlaxson Did you reproduce this error with 5.0.0rc2 or earlier versions?
@philsc Do you still see this error on your CI?

@philsc
Copy link
Contributor Author

philsc commented Dec 22, 2021

@philsc Do you still see this error on your CI?

I'm unfortunately on vacation until the new year so I haven't been running this RC as frequently as I normally would, apologies. I won't have any more data points until I get back from vacation.

@coeuvre
Copy link
Member

coeuvre commented Dec 22, 2021

No problem. Enjoy your vacation!

@brentleyjones
Copy link
Contributor

This is marked as the last release blocker. @philsc were you able to get more data points?

@philsc
Copy link
Contributor Author

philsc commented Jan 5, 2022

Apologies, I've been working on #14505 to see if there's anything on our side causing it. I'll see if I can still reproduce this error. Though if it's a race condition (for example) I may not see it again.

@philsc
Copy link
Contributor Author

philsc commented Jan 5, 2022

Since it came out, I ran 5.0.0rc3 a total 8 times so far on our repo (8 different days) and I encountered the crash 4 times. 2 of those times happened on the same build for different configurations.

In other words, it seems fairly prevalent. At least in our setup. Is there some more debugging information I can try gathering?

@larsrc-google
Copy link
Contributor

Running with --experimental_debug_spawn_scheduler will output some more debugging information. There's also a bunch of information in the log, with that option there will be a lot. If you're able to post it here or mail it to me, I can certainly take a look.

Would you be able to run it with one of the earlier RCs?

Does the error message always say that the local is cancelled and the remote is done, or does it vary? Are there earlier warnings about null results? The only way I can think of this happening is if the remote branch returns null, which it's not supposed to.

@philsc
Copy link
Contributor Author

philsc commented Jan 6, 2022

These are the 4 crashes. I trimmed other text because it was just regular build progress output.

(15:23:07) INFO: Caught InterruptedException from ExecException for remote branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o, which may cause a crash.
--
  | (15:23:08) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of simulation/validation/_objs/safety_headway_measurement_validation_params/safety_headway_measurement_validation_params.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
(10:40:35) INFO: Caught InterruptedException from ExecException for remote branch of common/logging/python/log_uri_python.so, which may cause a crash.
--
  | (10:40:38) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of common/logging/python/log_uri_python.so completed. Local was cancelled and done and remote was not cancelled and done.
(11:33:00) INFO: Caught InterruptedException from ExecException for remote branch of planning/ml/libtf_scope_join.so, which may cause a crash.
--
  | (11:33:09) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of planning/ml/libtf_scope_join.so completed. Local was cancelled and done and remote was not cancelled and done.
(11:09:44) INFO: Caught InterruptedException from ExecException for remote branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o, which may cause a crash.
--
  | (11:10:01) FATAL: bazel crashed due to an internal error. Printing stack trace:
  | java.lang.AssertionError: Neither branch of sensors/tools/linear_range/_objs/LinearRangeControllerModule_static/linear_range_controller_module.pic.o completed. Local was cancelled and done and remote was not cancelled and done.

Would you be able to run it with one of the earlier RCs?

I cannot because they immediately trigger analysis errors. RC3 is the first one I was able to run because of #14289 (i.e. fixed in RC3).

I will trigger some runs with --experimental_debug_spawn_scheduler.

@philsc
Copy link
Contributor Author

philsc commented Jan 6, 2022

I had one run with --experimental_debug_spawn_scheduler that resulted in 3 crashes. Let me check with legal if I'm allowed to post the whole thing or e-mail it to you, @larsrc-google.

@jlaxson
Copy link
Contributor

jlaxson commented Jan 6, 2022

I've been on release 5.0.0-pre.20210708.4 for a few weeks and have noticed nothing. Forget why I chose that one...

@larsrc-google
Copy link
Contributor

Those are interesting errors. The message is actually misleading - we caught the InterruptedException from an ExecutionExecption, not an ExecException. I think this means the branch thread (remote execution) threw InterruptedException in some other case than being cancelled. That would make the underlying problem be within the remote executor, which @coeuvre knows more about that me.

While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.

I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.

@philsc
Copy link
Contributor Author

philsc commented Jan 7, 2022

While waiting for legal to get back to you, just the outputs from those crashes might be useful in eliminating some possibilities.

Can you clarify what you mean by outputs? The stack traces are all the same, but let me double-check.
EDIT: Yeah, all the stack traces are identical to what I posted in the original report.

I'm also curious why your CI is using dynamic execution as Chi asked. I would have expected the CI to run all builds remotely.

I think @80degreeswest might be a better person to answer that. I haven't been involved with that aspect of our CI a whole lot.

@80degreeswest
Copy link

We use --experimental_spawn_scheduler to account for times where small targets seem to run very slow remotely, but execute extremely fast locally (not any specific targets, and not 100% of the time).

@larsrc-google
Copy link
Contributor

@philsc I mean other outputs when run with --experimental_debug_spawn_scheduler. A log, or at least the part of the log leading up to the crash, would be better of course. My two theories are 1) something inside the remote execution system interrupts something and the exception from that escapes and 2) there's some corner case where the dynamic execution system manages to interrupt the remote thread without it being marked as cancelled. I'm leaning strongly towards the former, though.

@coeuvre
Copy link
Member

coeuvre commented Jan 11, 2022

Still trying to reproduce this locally. I have a setup that builds Bazel itself with dynamic execution and ran it for 15+ hours but no lucky.

@coeuvre
Copy link
Member

coeuvre commented Jan 11, 2022

@philsc Can you try with flag --experimental_remote_grpc_log and share the gRPC log? I am curious about what the remote branch did when the error occurs.

@larsrc-google
Copy link
Contributor

@jlaxson @philsc Are you two working on the same code base?

@Wyverald
Copy link
Member

@philsc gentle ping. This is one of the last release blockers of 5.0 and I'd like to push out a final RC as soon as possible.

@philsc
Copy link
Contributor Author

philsc commented Jan 12, 2022

@Wyverald , apologies. I'm back from traveling.
I'm not trying to hold up the release. I'm happy to continue investigating this if you decide to release 5.0.
I need to catch up on e-mail and figure out what legal is letting me do.

EDIT: I submitted an anonymized log for their review. If accepted, I will post it here.

@philsc
Copy link
Contributor Author

philsc commented Jan 12, 2022

@jlaxson @philsc Are you two working on the same code base?

@larsrc-google , no we are not.

@Wyverald
Copy link
Member

@philsc Thank you! Given how elusive this issue is and its relatively low prevalence, I talked to a few people and decided to go ahead and release an RC4 without resolving this issue. If we can get this debugged and resolved soon-ish (say within a week from now), then we can cherrypick the fix, otherwise we'll just ship 5.0 and aim for a 5.0.1 point release to eventually fix this.

@larsrc-google
Copy link
Contributor

I now have a Bazel setup where I can do dynamic execution, but I don't have a target I can reproduce this with. If anyone sees this happening with a publicly available srcs, please let me know. Until then, I shall see what philsc can provide.

@philsc
Copy link
Contributor Author

philsc commented Jan 14, 2022

Here's an anonymized build log of one of the crashes: bazel-issue-14433.txt
I had to anonymize it in order to be allowed to post it, apologies. It should still be useful though. If not, let me know.
Next week I will hopefully be able to reproduce with --experimental_remote_grpc_log

@larsrc-google
Copy link
Contributor

Hm, not much to learn from that, alas. The local branch was cancelled, and then the remote branch got interrupted in a non-cancellation way.

I don't think we can get this debugged quickly. Don't let it hold up the release.

@Wyverald
Copy link
Member

Wyverald commented Feb 3, 2022

@bazel-io fork 5.1

@brentleyjones
Copy link
Contributor

This is marked as a release blocker for 5.1. Has any progress been made on it?

@philsc
Copy link
Contributor Author

philsc commented Feb 18, 2022

Apologies. I haven't had the time to collect more data. Specifically, reproducing the crash with --experimental_remote_grpc_log. I suspect I will get around to it next week.

@tomrenn
Copy link

tomrenn commented Feb 22, 2022

I'm able to reproduce this as well. I've shared logs using --experimental_remote_grpc_log and --experimental_debug_spawn_scheduler to @coeuvre.

@philsc
Copy link
Contributor Author

philsc commented Mar 1, 2022

I'm having some trouble extracting these artifacts from our CI pipelines. I should be able to get this resolved, create an anonymous version of the artifacts and upload them here. They appear to be ~300 MiB in size though.

EDIT: I guess the file size is related to how far into the build the crash occurs.

@philsc
Copy link
Contributor Author

philsc commented Mar 4, 2022

I found one crash that died very early in the run. Here's the anonymized output generated with --experimental_remote_grpc_log:
grpc-events-build.bin.5e2ba1cb-21ee-465b-88f6-7cf5520fc0e2.anonymized.gz

That log corresponds to this crash:

(22:38:46) INFO: Analyzed 69157 targets (0 packages loaded, 0 targets configured).
(22:38:46)
(22:38:46) INFO: Found 69157 targets...
(22:38:47) [0 / 280] [Prepa] Creating source manifest for //:file000000 ... (8 actions, 0 running)
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000001.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000002.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000003.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000004.pic.o, which may cause a crash.
(22:38:48) INFO: Caught InterruptedException from ExecException for remote branch of file000005.pic.o, which may cause a crash.
(22:38:48) FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.AssertionError: Neither branch of file000001.pic.o completed. Local was cancelled and done and remote was not cancelled and done.
        at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.waitBranches(DynamicSpawnStrategy.java:345)
        at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.exec(DynamicSpawnStrategy.java:733)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:68)
        at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1430)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

Note: I did not make make the filenames match in the text output and the GRPC log. If that's important, let me know and I can rework it.

@larsrc-google
Copy link
Contributor

This looks similar to a thing we've been investigating internally. Our current hypothesis is that something in the remote branch uses interrupts wrongly. It doesn't look like these happen because the remote branch gets interrupted by the dynamic execution system.

There's a general problem that if one of the branches uses interrupts internally, it's impossible to say if an interrupt was from the dynamic execution cancellation or from inside the branch. But that's not what we're seeing here.

@coeuvre
Copy link
Member

coeuvre commented Mar 8, 2022

Found the root cause! Working on the fix.

@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) type: bug and removed under investigation labels Mar 8, 2022
@coeuvre coeuvre self-assigned this Mar 8, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Mar 21, 2022
… enabled.

Fixes bazelbuild#14433.

The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong.

This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue.

Closes bazelbuild#15001.

PiperOrigin-RevId: 436180205
coeuvre added a commit to coeuvre/bazel that referenced this issue Mar 21, 2022
… enabled.

Fixes bazelbuild#14433.

The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong.

This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue.

Closes bazelbuild#15001.

PiperOrigin-RevId: 436180205
Wyverald pushed a commit that referenced this issue Mar 21, 2022
… enabled. (#15091)

Fixes #14433.

The root cause is, inside `RemoteExecutionCache`, the result of `FindMissingDigests` is shared with other threads without considering error handling. For example, if there are two or more threads uploading the same input and one thread got interrupted when waiting for the result of `FindMissingDigests` call, the call is cancelled and others threads still waiting for the upload will receive upload error due to the cancellation which is wrong.

This PR fixes this by effectively applying reference count to the result of `FindMissingDigests` call so that if one thread got interrupted, as long as there are other threads depending on the result, the call won't be cancelled and the upload can continue.

Closes #15001.

PiperOrigin-RevId: 436180205
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants