Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file handle exhaustion when fetching from remote cache #13435

Closed
jablin opened this issue May 6, 2021 · 9 comments
Closed

file handle exhaustion when fetching from remote cache #13435

jablin opened this issue May 6, 2021 · 9 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@jablin
Copy link
Contributor

jablin commented May 6, 2021

Description of the problem

When extensively using a remote cache (i.e. 100% of the build results of a large project), bazel build (4.0.0) hits "too many file handles" if you use an empty --disk_cache at the same time.
This is reproducible. On an 8 core CPU (+ hyperthreading) it usually happens after ~4k targets.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  • setup a project with tens of thousands of targets.
  • setup a remote cache
# upload all build artifacts to the remote cache
bazel clean --expunge
bazel build --remote_cache=http://<host> --remote_upload_local_results=true //... 

# trigger as many downloads as possible
bazel clean --expunge
rm -fr /tmp/disk-cache
mkdir /tmp/disk-cache
bazel build --remote_cache=http://<host> //... --disk_cache=disk-cache

This eventually leads to:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=ConfiguredTargetKey{[...]
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:563)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.ChannelException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel
        at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:46)
        at io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:310)
        at io.netty.bootstrap.Bootstrap.doResolveAndConnect(Bootstrap.java:155)
        at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:116)
        at io.netty.channel.pool.SimpleChannelPool.connectChannel(SimpleChannelPool.java:265)
        at io.netty.channel.pool.SimpleChannelPool.acquireHealthyFromPoolOrNew(SimpleChannelPool.java:177)
        at io.netty.channel.pool.SimpleChannelPool.acquire(SimpleChannelPool.java:162)
        at io.netty.channel.pool.FixedChannelPool.runTaskQueue(FixedChannelPool.java:354)
        at io.netty.channel.pool.FixedChannelPool.decrementAndRunTaskQueue(FixedChannelPool.java:335)
        at io.netty.channel.pool.FixedChannelPool.access$500(FixedChannelPool.java:40)
        at io.netty.channel.pool.FixedChannelPool$4.operationComplete(FixedChannelPool.java:311)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
        at io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)
        at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:501)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        ... 1 more
Caused by: java.lang.reflect.InvocationTargetException
        at jdk.internal.reflect.GeneratedConstructorAccessor17.newInstance(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
        at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:44)
        ... 21 more
Caused by: io.netty.channel.ChannelException: Failed to open a socket.
        at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:71)
        at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:88)
        at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:81)
        ... 25 more
Caused by: java.net.SocketException: Too many open files
        at java.base/sun.nio.ch.Net.socket0(Native Method)
        at java.base/sun.nio.ch.Net.socket(Unknown Source)
        at java.base/sun.nio.ch.Net.socket(Unknown Source)
        at java.base/sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)
        at java.base/sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)
        at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:69)
        ... 27 more

What operating system are you running Bazel on?

RHEL 7.4 (kernel 3.10.0)

What's the output of bazel info release?

release 4.0.0

Have you found anything relevant by searching the web?

I have been told to modify /etc/systemd/system.conf: set DefaultLimitNOFILE=524288, systemctl daemon-reload and reboot.

Any other information, logs, or outputs that you want to share?

  • increasing ulimit -Sn from (default) 1024 to 4095 does not help at all
  • This problem occurs even though the remote cache server (nginx) uses the setting worker_connections 512;
@coeuvre coeuvre added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug P2 We'll consider working on this in future. (Assignee optional) labels May 7, 2021
@coeuvre coeuvre self-assigned this May 7, 2021
@coeuvre
Copy link
Member

coeuvre commented May 7, 2021

Can you try whether it is reproducible with 4.1.0rc4 where we made some improvement to the channel pool?

@obruns
Copy link

obruns commented May 7, 2021

This happens with 4.1.0rc4 as well. I should add that @jablin and I use remote caching and --disk_cache /path/to/somewhere. This seems to be an issue of composing remote caching with local caching.

@jablin
Copy link
Contributor Author

jablin commented May 10, 2021

I confirm the observations of @obruns . What's more: In order to reproduce the problem, you have to have an empty --disk_cache.
I suppose there is a file handle leak when Bazel fetches thousands of entries from the remote cache and copies them to the --disk-cache in a few seconds.

@coeuvre
Copy link
Member

coeuvre commented May 13, 2021

I can't reproduce the bug.

My setup is 50000 genrule targets with many input files which just copy its inputs to outputs.

When the ulimit -n is set to 1024, I do get "Too many open files" error but with complete different stack trace:

WARNING: Reading from Remote Cache:
java.io.IOException: [DEDUCTED]/disk_cache/ed0785d0-aebb-4391-9e20-d1e95c93047c (Too many open files)
        at com.google.devtools.build.lib.unix.NativePosixFiles.openWrite(Native Method)
        at com.google.devtools.build.lib.unix.UnixFileSystem.createFileOutputStream(UnixFileSystem.java:482)
        at com.google.devtools.build.lib.vfs.AbstractFileSystem.getOutputStream(AbstractFileSystem.java:128)
        at com.google.devtools.build.lib.vfs.Path.getOutputStream(Path.java:534)
        at com.google.devtools.build.lib.vfs.Path.getOutputStream(Path.java:522)
        at com.google.devtools.build.lib.remote.disk.DiskCacheClient.saveFile(DiskCacheClient.java:172)
        at com.google.devtools.build.lib.remote.disk.DiskCacheClient.uploadActionResult(DiskCacheClient.java:116)
        at com.google.devtools.build.lib.remote.disk.DiskAndRemoteCacheClient.lambda$downloadActionResult$3(DiskAndRemoteCacheClient.java:185)
        at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
        at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
        at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:738)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:110)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:800)
        at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.setResult(AbstractTransformFuture.java:224)
        at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.setResult(AbstractTransformFuture.java:202)
        at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:738)
        at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:47)
        at com.google.devtools.build.lib.remote.http.HttpCacheClient.lambda$get$5(HttpCacheClient.java:503)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
        at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604)
        at io.netty.util.concurrent.DefaultPromise.setSuccess(DefaultPromise.java:96)
        at io.netty.channel.DefaultChannelPromise.setSuccess(DefaultChannelPromise.java:78)
        at io.netty.channel.DefaultChannelPromise.setSuccess(DefaultChannelPromise.java:73)
        at com.google.devtools.build.lib.remote.http.HttpDownloadHandler.succeedAndReset(HttpDownloadHandler.java:182)
        at com.google.devtools.build.lib.remote.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:112)
        at com.google.devtools.build.lib.remote.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:43)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)

When the ulimit -n is increased to 4096, the error is gone and build succeeds. I assume this is a valid error in my case.

Can you please share a more concrete repro?

@jablin
Copy link
Contributor Author

jablin commented May 20, 2021

I haven't succeeded in reproducing the problem in a "synthetic" project yet.

I assume this is a valid error in my case.

I don't think so. Why do you think that 1024 file handles are objectively insufficient? I think they are not: On many distributions, a resource limit of 1024 file handles is the default.

I still think that the error that you encountered is one more (strong) hint that the disk cache code hogs hogs hundreds of file handles.

@coeuvre
Copy link
Member

coeuvre commented May 21, 2021

I still think that the error that you encountered is one more (strong) hint that the disk cache code hogs hogs hundreds of file handles.

We can't assume the error from my setup is caused by file handle leaks. The setup was intended to have the files opened as many as we can: these 50000 genrule targets don't have deps, and each of them has many input/output files. I build all the targets with --jobs=500 so having more than 1024 opened files at one time is intended.

I don't think so. Why do you think that 1024 file handles are objectively insufficient?

Since the setup is not the shape of builds in general, I wouldn't bother too much about file handle exhaustion in this case. However, I agree with you if the error become more frequent among builds in which case we could probably use a write queue.

On the other hand, having file handle leak is a serious bug and should be fixed.

@jablin
Copy link
Contributor Author

jablin commented Jun 7, 2021

I've monitored the file handle usage of the bazel process during my build:

[....]
137
140
137
132
120
121
398
2154
3652
4094
4093
4090
3636
2662
544
59
59
59
59
59
[...]

It's easy to see when the "out of file handles" occurs.

@jablin
Copy link
Contributor Author

jablin commented Jun 7, 2021

I've sampled in more detail:

pid=$(pgrep java) && while true; do find /proc/$pid/fd -ls; echo ============; sleep 0.1; done |& tee /tmp/log

I one of the samples, out of 4095 open files, 3922 were writing to the local disk cache.

In the end, the files that Bazel was writing to the local disk cache (according to /proc/<pid>/fd/) are either empty or do not exist.

@jablin
Copy link
Contributor Author

jablin commented Jun 11, 2021

I've added logging for around opening and closing files in file DiskAndRemoteCacheClient.java. Et voilà: When the build crashes due to too many open files, the incidence of "open" and "close" in the log file matches closely my ulimit for open file handles.

Apparently, DiskAndRemoteCacheClient either

  • (a) opens the temporary files too early or
  • (b) the JRE does not execute the Futures for closing the temporary files in time.

I still don't know how to reproduce the problem without my original project. Sorry. However, here's the crucial part of a (working!) fix. Instead of trying to fix (b) by tweaking the execution I've found a simple solution to fix (a).

--- a/src/main/java/com/google/devtools/build/lib/remote/disk/DiskAndRemoteCacheClient.java
+++ b/src/main/java/com/google/devtools/build/lib/remote/disk/DiskAndRemoteCacheClient.java
@@ -142,11 +142,7 @@ public final class DiskAndRemoteCacheClient implements RemoteCacheClient {
 
     Path tempPath = newTempPath();
     final OutputStream tempOut;
-    try {
-      tempOut = tempPath.getOutputStream();
-    } catch (IOException e) {
-      return Futures.immediateFailedFuture(e);
-    }
+    tempOut = new LazyFileOutputStream(tempPath);

jablin added a commit to jablin/bazel that referenced this issue Jun 14, 2021
Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
in order to avoid "Too many open files".

Resolves bazelbuild#13435
jablin added a commit to jablin/bazel that referenced this issue Jul 14, 2021
… lazily

Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
in order to avoid "Too many open files", probably triggered by build steps
with > 1k output files

Resolves bazelbuild#13435
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
…"too many open files"

Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
in order to avoid "Too many open files".

Resolves bazelbuild#13435

Closes bazelbuild#13574.

PiperOrigin-RevId: 379892227
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
…"too many open files"

Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
in order to avoid "Too many open files".

Resolves bazelbuild#13435

Closes bazelbuild#13574.

PiperOrigin-RevId: 379892227
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
…"too many open files"

Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
in order to avoid "Too many open files".

Resolves bazelbuild#13435

Closes bazelbuild#13574.

PiperOrigin-RevId: 379892227
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
…"too many open files"

Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
in order to avoid "Too many open files".

Resolves bazelbuild#13435

Closes bazelbuild#13574.

PiperOrigin-RevId: 379892227
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 16, 2021
…"too many open files"

Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
in order to avoid "Too many open files".

Resolves bazelbuild#13435

Closes bazelbuild#13574.

PiperOrigin-RevId: 379892227
luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
… "too many open files"

    Re-use the existing "LazyFileOutputStream" in DiskAndRemoteCacheClient.java
    in order to avoid "Too many open files".

    Resolves bazelbuild/bazel#13435

    Closes #13574.

    PiperOrigin-RevId: 379892227
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
3 participants