E0216 from backup_poller.cc or ev_epollex_linux.cc #572

uduse · 2022-02-16T10:43:56Z

🐛 Bug

This happens when I spawn multiple compiler gym sessions in parallel. The error comes in two forms but I think they similar:

either:

E0216 05:27:23.400729219  198153 ev_epollex_linux.cc:515]    Error shutting down fd 16. errno: 9```

or:

E0216 05:27:43.469307484  293529 backup_poller.cc:134]       Run client channel backup poller: {"created":"@1645007263.469241393","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":320,"referenced_errors":[{"created":"@1645007263.469233691","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":950,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}

To Reproduce

Steps to reproduce the behavior:

I managed to create this example to consistently recreates this behavior:

import compiler_gym
import ray


@ray.remote
def spawn():
    with compiler_gym.make("llvm-ic-v0") as env:
        env.reset()
        _, _, done, _ = env.step(env.action_space.sample())
        if done:
            raise RuntimeError


ray.get([spawn.remote() for _ in range(5000)])

example output from the above script:

$ python sandbox.py 
2022-02-16 05:31:32,323 INFO services.py:1374 -- View the Ray dashboard at http://127.0.0.1:8265
(spawn pid=438029) E0216 05:31:43.796126540  482891 backup_poller.cc:134]       Run client channel backup poller: {"created":"@1645007503.796063126","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":320,"referenced_errors":[{"created":"@1645007503.796054426","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":950,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
(spawn pid=437973) E0216 05:31:48.767850290  502732 backup_poller.cc:134]       Run client channel backup poller: {"created":"@1645007508.767783963","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":320,"referenced_errors":[{"created":"@1645007508.767776654","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":950,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}

Sometimes the script above leave orphan processes behind (related: #326), but it seems all the environments created above should be closed properly.

This means right after the script above, sometimes doing ps aux | grep compiler_gym-llvm-service | grep -v grep | awk '{print $2}' | wc -l yields a number that's not 0.

Environment

Please fill in this checklist:

CompilerGym: compiler-gym==0.2.2
How you installed CompilerGym (conda, pip, source): pip
OS: Linux
Python version: 3.8.12
Build command you used (if compiling from source):
GCC/clang version (if compiling from source):
Bazel version (if compiling from source):
Versions of any other relevant libraries: ray==1.10.0

PyTorch version: 1.10.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:57:06)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-26-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 440.95.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy==0.920
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] torch==1.10.1
[conda] mypy_extensions           0.4.3            py38h578d9bd_4    conda-forge
[conda] numpy                     1.19.5                   pypi_0    pypi
[conda] torch                     1.10.1                   pypi_0    pypi

The text was updated successfully, but these errors were encountered:

ChrisCummins · 2022-02-16T11:46:19Z

Hi Zeyi,

Thanks for the detailed bug report. It sure looks like a race condition in environment startup (or shutdown) that is causing the gRPC services to get confused. Unfortunately I haven't been able to reproduce the problem on a similar 24-core Linux machine. How many CPU cores do you have? ($ nproc)

Does the error reproduce if you only create the environments without interacting with them?

@ray.remote
def spawn():
    with compiler_gym.make("llvm-ic-v0") as env:
        pass

Workaround

This will massively slowdown environment startup, but as a quick workaround you could use an inter-process lock to make sure that only a single environment is created at a time. Far from ideal, but a quick "fix":

import compiler_gym
import ray
from fasteners import InterProcessLock

def locked_compiler_gym_make(*args, **kwargs):
    """Wrapper around compiler_gym.make() that guarantees synchronous calls."""
    with InterProcessLock("/dev/shm/test.lock"):
        return compiler_gym.make(*args, **kwargs)


@ray.remote
def spawn():
    with locked_compiler_gym_make("llvm-ic-v0") as env:
        env.reset()
        _, _, done, _ = env.step(env.action_space.sample())
        if done:
            raise RuntimeError

ray.get([spawn.remote() for _ in range(5000)])

Cheers,
Chris

uduse · 2022-02-16T19:49:55Z

In my case, nproc==72. The script could fail without env.step(), but much unlikely. My "fix" was to limit the number of parallel instances. This means with less environment created in parallel, this problem also happens much less likely. As a result, I wouldn't be surprised if it's hard to reproduce with nproc==24. I tried your workaround and it seems to work, but yes it drastically slows down the process 😭 A slightly better band-aid fix would be using an inter-process semaphore (with value of 24, for example), but of course the best case is compiler gym actually fix this 😉

uduse · 2022-02-16T22:23:29Z

Moreover, one difficulty of dealing with this problem is that no exception is thrown to my Python process. If I could catch and process that error, I would happily throw away ~3 problematic environments out of 5000 to retain the ability to create environments in parallel. Similarly, the environment could report this in the return value of env.step() for the Python process to handle. Currently it seems there's no way for the Python process to know about the error happening under the hood.

ChrisCummins · 2022-02-16T22:44:19Z

Okay, 72 isn't crazy, there's no reason we should expect CompilerGym not to be able to support that. My hunch is that there is a data race in the gRPC service startup that can cause more than one environment service to be bound to the same port when the system is launching a bunch of services in parallel. We let the host operating system pick an unused port to launch a new service on, I would start by looking for race opportunities here:

https:/facebookresearch/CompilerGym/blob/development/compiler_gym/service/runtime/CreateAndRunCompilerGymServiceImpl.h#L94-L98

For temporary workaround, something as simple as a random pause may be right balance between reducing the risk of error and taking too long:

@ray.remote
def spawn():
    time.sleep(random.random() * 10)
    pass

What exactly is the symptom of this error? Do the environments that log this error message stop working? I'm looking at the the source that logs the first error you provided and it's not clear to me the severity of the problem.

Cheers,
Chris

ChrisCummins · 2022-02-16T22:49:44Z

Possible relevant: grpc/grpc#10755

uduse · 2022-02-17T08:03:13Z

The sleeping slows down things but don't solve the problem. My program spawns environments in multiple places repeatedly, and there's still a chance of clashing too many requests at the same time. In terms of the symptoms, most of the time the program treats this as a warning and proceed normally (not sure if the bug changes the correctness of future environments or something), occasionally it hangs a process, and also occasionally an error is thrown to Python telling me that I'm using an environment that's already closed. The behavior is very inconsistent and I'm not even sure the error message I shown above actually address the place where something went wrong.

Since current workarounds (I also tried a multi-process semaphore) are too slow, my approach right now is just try to get lucky and discard the runs that hangs or throws an exception.

ChrisCummins · 2022-02-17T17:58:59Z

I've started looking into this, I'll let you know how I get on.

In the meantime, another possible workaround I was thinking about. If the problem is that there is a race condition in free port assignment, you could allocate a single port value to each process so that there shouldn't be any conflicts. This is still a total hack but might help you get unblocked:

from multiprocessing import Process
import compiler_gym
from compiler_gym.service import ConnectionOpts
from tqdm import tqdm

def spawn(port: int):
    opts = ConnectionOpts(script_args=[f"--port={port}"])
    with compiler_gym.make("llvm-v0", connection_settings=opts) as env:
        env.reset()
        env.step(0)

START_PORT = 10000
NUM_PROC = 1000

processes = [Process(target=spawn, args=(START_PORT + i,)) for i in range(NUM_PROC)]
for p in tqdm(processes):
    p.start()
for p in tqdm(processes):
    p.join()

Also, do you need to keep creating new environments? compiler_gym.make() is much more expensive than env.reset(). If the number of simultaneous environments you need is static then you can allocate them all once.

Cheers,
Chris

This corrects my use of the boost::process API whereby llvm-size child processes could become detached from the parent environment service and be left as zombies. I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if this bug is relevant.

ChrisCummins · 2022-02-21T17:50:08Z

Hi @uduse, can you please let me know what version of gRPC you have?

$ python -m pip freeze | grep grpcio

I'm seeing some more logging errors from gRPC 1.44 and think this may be relevant ray-project/ray#22518

Cheers,
Chris

facebookresearch#572

uduse · 2022-02-23T03:05:52Z

I got

grpcio==1.42.0
grpcio-tools==1.43.0

uduse · 2022-02-25T02:48:39Z

I couldn't reproduce the bug with multiprocessing. I think it might be a bug in ray or some weird interaction between ray and compiler_gym since both of them use gRPC.

ChrisCummins · 2022-03-01T19:00:44Z

I couldn't reproduce the bug with multiprocessing. I think it might be a bug in ray or some weird interaction between ray and compiler_gym since both of them use gRPC.

Oo interesting, thanks for following up! I have been working on a big update to gym.make() that uses an object pool for caching environments, enabling gRPC services to be re-used across environments. It's not ready for merging just yet but it should help with the issue you're seeing, and also speeds up environment construction >100x...

Cheers,
Chris

uduse · 2022-03-01T22:09:49Z

I couldn't reproduce the bug with multiprocessing. I think it might be a bug in ray or some weird interaction between ray and compiler_gym since both of them use gRPC.

Oo interesting, thanks for following up! I have been working on a big update to gym.make() that uses an object pool for caching environments, enabling gRPC services to be re-used across environments. It's not ready for merging just yet but it should help with the issue you're seeing, and also speeds up environment construction >100x...

Cheers, Chris

🤣 I just implemented this pool for caching in my own project, called EnvSupervisor and I kept thinking it would be nice to have it handled on the compiler_gym's side.

To be more specific, my class:

Put aside no-longer-needed environments, and resets them when a new environment is needed.
Keep track of all spawned environment instances, so the user could close all of them with one function call.
Environments are automatically putted aside when its accessor is dereferenced.
Slightly faster (4%) in terms of system speed (my algorithms now don't make too many environments now and the speedup is from recycling no-longer-needed environments).

This tidies up my algorithm implementations as it removes environment management.

ChrisCummins · 2022-03-01T22:14:40Z

This sounds great. The change that I am working on caches only the service connections (the env.service object), not the entire environment, but otherwise sounds similar.

Glad its working out for you :)

This corrects my use of the boost::process API whereby llvm-size child processes could become detached from the parent environment service and be left as zombies. I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if this bug is relevant.

facebookresearch#572

uduse added the Bug Something isn't working label Feb 16, 2022

ChrisCummins mentioned this issue Feb 17, 2022

Hardening patchset #578

Merged

ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this issue Feb 22, 2022

Hold back grpcio version.

4131cb2

facebookresearch#572

uduse mentioned this issue Mar 1, 2022

env.spec.make makes a different benchmark #587

Open

ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this issue Mar 6, 2022

Hold back grpcio version.

6a6ee69

facebookresearch#572

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E0216 from backup_poller.cc or ev_epollex_linux.cc #572

E0216 from backup_poller.cc or ev_epollex_linux.cc #572

uduse commented Feb 16, 2022 •

edited

Loading

ChrisCummins commented Feb 16, 2022

uduse commented Feb 16, 2022

uduse commented Feb 16, 2022 •

edited

Loading

ChrisCummins commented Feb 16, 2022

ChrisCummins commented Feb 16, 2022

uduse commented Feb 17, 2022

ChrisCummins commented Feb 17, 2022

ChrisCummins commented Feb 21, 2022

uduse commented Feb 23, 2022

uduse commented Feb 25, 2022

ChrisCummins commented Mar 1, 2022 •

edited

Loading

uduse commented Mar 1, 2022

ChrisCummins commented Mar 1, 2022

E0216 from backup_poller.cc or ev_epollex_linux.cc #572

E0216 from backup_poller.cc or ev_epollex_linux.cc #572

Comments

uduse commented Feb 16, 2022 • edited Loading

🐛 Bug

To Reproduce

Environment

ChrisCummins commented Feb 16, 2022

Workaround

uduse commented Feb 16, 2022

uduse commented Feb 16, 2022 • edited Loading

ChrisCummins commented Feb 16, 2022

ChrisCummins commented Feb 16, 2022

uduse commented Feb 17, 2022

ChrisCummins commented Feb 17, 2022

ChrisCummins commented Feb 21, 2022

uduse commented Feb 23, 2022

uduse commented Feb 25, 2022

ChrisCummins commented Mar 1, 2022 • edited Loading

uduse commented Mar 1, 2022

ChrisCummins commented Mar 1, 2022

uduse commented Feb 16, 2022 •

edited

Loading

uduse commented Feb 16, 2022 •

edited

Loading

ChrisCummins commented Mar 1, 2022 •

edited

Loading