-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ErrorOutOfHostMemory] server shut down in environment interaction #209
Comments
What command are you running? How many parallel environments are you using and are you using the ManiSkill vecenv? |
I encountered this error on multiple occasions:
I found that there was no error when I ran the experiment in only 2 GPUs in case 2. However, when I increased the number of experiments, despite the VRAM of my GPUs being sufficiently ample, the aforementioned RunTimeError emerged. |
I got the similar error again in case 2. 😢 7086 │ │
7087 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/wrappers/order_enforc │
7088 │ ing.py:56 in step │
7089 │ │
7090 │ 53 │ │ """Steps through the environment with `kwargs`.""" │
7091 │ 54 │ │ if not self._has_reset: │
7092 │ 55 │ │ │ raise ResetNeeded("Cannot call env.step() before calling env.reset()") │
7093 │ ❱ 56 │ │ return self.env.step(action) │
7094 │ 57 │ │
7095 │ 58 │ def reset(self, **kwargs): │
7096 │ 59 │ │ """Resets the environment with `kwargs`.""" │
7097 │ │
7098 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py:522 in step │
7099 │ │
7100 │ 519 │ │ self, action: ActType │
7101 │ 520 │ ) -> tuple[WrapperObsType, SupportsFloat, bool, bool, dict[str, Any]]: │
7102 │ 521 │ │ """Modifies the :attr:`env` after calling :meth:`step` using :meth:`self.observa │
7103 │ ❱ 522 │ │ observation, reward, terminated, truncated, info = self.env.step(action) │
7104 │ 523 │ │ return self.observation(observation), reward, terminated, truncated, info │
7105 │ 524 │ │
7106 │ 525 │ def observation(self, observation: ObsType) -> WrapperObsType: │
7107 │ │
7108 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7109 │ i_skill2/envs/sapien_env.py:557 in step │
7110 │ │
7111 │ 554 │ │ self.step_action(action) │
7112 │ 555 │ │ self._elapsed_steps += 1 │
7113 │ 556 │ │ │
7114 │ ❱ 557 │ │ obs = self.get_obs() │
7115 │ 558 │ │ info = self.get_info(obs=obs) │
7116 │ 559 │ │ reward = self.get_reward(obs=obs, action=action, info=info) │
7117 │ 560 │ │ terminated = self.get_done(obs=obs, info=info) │
7118 │ │
7119 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7120 │ i_skill2/envs/sapien_env.py:263 in get_obs │
7121 │ │
7122 │ 260 │ │ elif self._obs_mode == "state_dict": │
7123 │ 261 │ │ │ return self._get_obs_state_dict() │
7124 │ 262 │ │ elif self._obs_mode == "image": │
7125 │ ❱ 263 │ │ │ return self._get_obs_images() │
7126 │ 264 │ │ else: │
7127 │ 265 │ │ │ raise NotImplementedError(self._obs_mode) │
7128 │ 266 │
7129 │ │
7130 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7131 │ i_skill2/envs/sapien_env.py:318 in _get_obs_images │
7132 │ │
7133 │ 315 │ │ │ agent=self._get_obs_agent(), │
7134 │ 316 │ │ │ extra=self._get_obs_extra(), │
7135 │ 317 │ │ │ camera_param=self.get_camera_params(), │
7136 │ ❱ 318 │ │ │ image=self.get_images(), │
7137 │ 319 │ │ ) │
7138 │ 320 │ │
7139 │ 321 │ @property │
7140 │ │
7141 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7142 │ i_skill2/envs/sapien_env.py:296 in get_images │
7143 │ │
7144 │ 293 │ │ """Get (raw) images from all cameras (blocking).""" │
7145 │ 294 │ │ images = OrderedDict() │
7146 │ 295 │ │ for name, cam in self._cameras.items(): │
7147 │ ❱ 296 │ │ │ images[name] = cam.get_images() │
7148 │ 297 │ │ return images │
7149 │ 298 │ │
7150 │ 299 │ def get_camera_params(self) -> Dict[str, Dict[str, np.ndarray]]: │
7151 │ │
7152 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7153 │ i_skill2/sensors/camera.py:189 in get_images │
7154 │ │
7155 │ 186 │ │ for name in self.texture_names: │
7156 │ 187 │ │ │ dtype = self.TEXTURE_DTYPE[name] │
7157 │ 188 │ │ │ if dtype == "float": │
7158 │ ❱ 189 │ │ │ │ image = self.camera.get_float_texture(name) │
7159 │ 190 │ │ │ elif dtype == "uint32": │
7160 │ 191 │ │ │ │ image = self.camera.get_uint32_texture(name) │
7161 │ 192 │ │ │ else: │
7162 ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
7163 RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory |
Can you try the code and also simultaneously checking with GPUs have memory being used by your code? If only one is being used then I may have some idea |
In each sub-process, can you try setting the CUDA_VISIBLE_DEVICES to different values?
Let me know if this works |
Unfortunately, I have already used |
@fbxiang any idea about this? It seems like SAPIEN is just using one GPU for some reason? Another question @csmile-1006 what task is this testing on? Hopefully not a huge issue in the near future. We are currently working on MS3 which is even faster on a single gpu than MS2 was on multiple CPUs/GPUs for visual observations, hopefully releasing a usable version in a month or two. |
I think I know the cause of this issue but I do not have a workaround for ManiSkill2. I believe this issue is resolved in our latest GPU parallel env (in development). In ManiSkill2, each camera creates a fence, and the GPU has a limited synchronization primitive (fence) count globally, so creating a lot of cameras could hit this limit. In the latest SAPIEN we allow synchronizing all cameras together to avoid this issue. |
@StoneT2000 I am now dealing with PickSingleEGAD-v0 and TurnFaucet-v0 |
@csmile-1006 we have just released a beta version of ManiSkill 3 which may resolve your issues. The two tasks you used are currently not ported over to ManiSkill 3 (we do not plan to port over the EGAD task, TurnFaucet will probably be ported over). But the other tasks can be tested. |
Closing the issue as it is stale now |
I am conducting reinforcement learning experiments using ManiSKill2 and frequently encounter errors such as
RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory.
Following this error, the GPU server shuts down, a problem recurring on multiple servers. Do you happen to be familiar with this issue, or do you have any solutions?
Below is the detailed traceback for your reference:
The text was updated successfully, but these errors were encountered: