Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ErrorOutOfHostMemory] server shut down in environment interaction #209

Closed
csmile-1006 opened this issue Feb 10, 2024 · 12 comments
Closed

Comments

@csmile-1006
Copy link

I am conducting reinforcement learning experiments using ManiSKill2 and frequently encounter errors such as
RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory.

Following this error, the GPU server shuts down, a problem recurring on multiple servers. Do you happen to be familiar with this issue, or do you have any solutions?

Below is the detailed traceback for your reference:

 Traceback (most recent call last):
42   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
43     return _run_code(code, main_globals, None,
44   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/runpy.py", line 87, in _run_code
45     exec(code, run_globals)
46   File "/data2/changyeon/NeurIPS2024/rlpd/maniskill_train_finetuning_pixels.py", line 250, in <module>
47     app.run(main)
48   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/absl/app.py", line 308, in run
49     _run_main(main, args)
50   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
51     sys.exit(main(argv))
52   File "/data2/changyeon/NeurIPS2024/rlpd/maniskill_train_finetuning_pixels.py", line 197, in main
53     next_observation, reward, done, info = env.step(action)
54   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/wandb_video.py", line 49, in step
55     obs, reward, done, info = super().step(action)
56   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gym/core.py", line 280, in step
57     return self.env.step(action)
58   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gym/wrappers/record_episode_statistics.py", line 28, in step
59     observations, rewards, dones, infos = super().step(action)
60   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gym/core.py", line 280, in step
61     return self.env.step(action)
62   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/frame_stack.py", line 45, in step
63     obs, reward, done, info = self.env.step(action)
64   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/repeat_action.py", line 16, in step
65     obs, reward, done, info = self.env.step(action)
66   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/maniskill_wrapper.py", line 704, in step
67     ob, rew, terminated, truncated, info = super().step(action)
68   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py", line 461, in step
69     return self.env.step(action)
70   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/maniskill_wrapper.py", line 219, in step
71     next_obs, reward, terminated, truncated, info = super(ManiSkill2_ObsWrapper, self).step(action)
72   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py", line 522, in step
73     observation, reward, terminated, truncated, info = self.env.step(action)
74   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/wrappers/time_limit.py", line 57, in step
75     observation, reward, terminated, truncated, info = self.env.step(action)
76   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/wrappers/order_enforcing.py", line 56, in step
77     return self.env.step(action)
78   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py", line 522, in step
79     observation, reward, terminated, truncated, info = self.env.step(action)
80   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 557, in step
81     obs = self.get_obs()
82   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 263, in get_obs
83     return self._get_obs_images()
84   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 318, in _get_obs_images
85     image=self.get_images(),
86   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 296, in get_images
87     images[name] = cam.get_images()
88   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/sensors/camera.py", line 189, in get_images
89     image = self.camera.get_float_texture(name)
90 RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory
@StoneT2000
Copy link
Member

What command are you running? How many parallel environments are you using and are you using the ManiSkill vecenv?

@csmile-1006
Copy link
Author

I encountered this error on multiple occasions:

  1. I executed the script from ManiSkill2-Learn on 2 GPUs, with one experiment running on each GPU and utilizing num_procs = 16 for parallel environment parameters. (This issue manifested across two distinct servers.)

  2. I run my custom rlpd with ManiSKill2 envs. I use a single environment using Gym, and I ran 4 experiments in 4 GPUs (one experiment for each GPU, I use a single env for each experiment).

I found that there was no error when I ran the experiment in only 2 GPUs in case 2. However, when I increased the number of experiments, despite the VRAM of my GPUs being sufficiently ample, the aforementioned RunTimeError emerged.

@csmile-1006
Copy link
Author

I got the similar error again in case 2. 😢

7086 │                                                                                                  │
7087 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/wrappers/order_enforc │
7088 │ ing.py:56 in step                                                                                │
7089 │                                                                                                  │
7090 │   53 │   │   """Steps through the environment with `kwargs`."""                                  │
7091 │   54 │   │   if not self._has_reset:                                                             │
7092 │   55 │   │   │   raise ResetNeeded("Cannot call env.step() before calling env.reset()")          │
7093 │ ❱ 56 │   │   return self.env.step(action)                                                        │
7094 │   57 │                                                                                           │
7095 │   58 │   def reset(self, **kwargs):                                                              │
7096 │   59 │   │   """Resets the environment with `kwargs`."""                                         │
7097 │                                                                                                  │
7098 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py:522 in step   │
7099 │                                                                                                  │
7100 │   519 │   │   self, action: ActType                                                              │
7101 │   520 │   ) -> tuple[WrapperObsType, SupportsFloat, bool, bool, dict[str, Any]]:                 │
7102 │   521 │   │   """Modifies the :attr:`env` after calling :meth:`step` using :meth:`self.observa   │
7103 │ ❱ 522 │   │   observation, reward, terminated, truncated, info = self.env.step(action)           │
7104 │   523 │   │   return self.observation(observation), reward, terminated, truncated, info          │
7105 │   524 │                                                                                          │
7106 │   525 │   def observation(self, observation: ObsType) -> WrapperObsType:                         │
7107 │                                                                                                  │
7108 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7109 │ i_skill2/envs/sapien_env.py:557 in step                                                          │
7110 │                                                                                                  │
7111 │   554 │   │   self.step_action(action)                                                           │
7112 │   555 │   │   self._elapsed_steps += 1                                                           │
7113 │   556 │   │                                                                                      │
7114 │ ❱ 557 │   │   obs = self.get_obs()                                                               │
7115 │   558 │   │   info = self.get_info(obs=obs)                                                      │
7116 │   559 │   │   reward = self.get_reward(obs=obs, action=action, info=info)                        │
7117 │   560 │   │   terminated = self.get_done(obs=obs, info=info)                                     │
7118 │                                                                                                  │
7119 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7120 │ i_skill2/envs/sapien_env.py:263 in get_obs                                                       │
7121 │                                                                                                  │
7122 │   260 │   │   elif self._obs_mode == "state_dict":                                               │
7123 │   261 │   │   │   return self._get_obs_state_dict()                                              │
7124 │   262 │   │   elif self._obs_mode == "image":                                                    │
7125 │ ❱ 263 │   │   │   return self._get_obs_images()                                                  │
7126 │   264 │   │   else:                                                                              │
7127 │   265 │   │   │   raise NotImplementedError(self._obs_mode)                                      │
7128 │   266                                                                                            │
7129 │                                                                                                  │
7130 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7131 │ i_skill2/envs/sapien_env.py:318 in _get_obs_images                                               │
7132 │                                                                                                  │
7133 │   315 │   │   │   agent=self._get_obs_agent(),                                                   │
7134 │   316 │   │   │   extra=self._get_obs_extra(),                                                   │
7135 │   317 │   │   │   camera_param=self.get_camera_params(),                                         │
7136 │ ❱ 318 │   │   │   image=self.get_images(),                                                       │
7137 │   319 │   │   )                                                                                  │
7138 │   320 │                                                                                          │
7139 │   321 │   @property                                                                              │
7140 │                                                                                                  │
7141 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7142 │ i_skill2/envs/sapien_env.py:296 in get_images                                                    │
7143 │                                                                                                  │
7144 │   293 │   │   """Get (raw) images from all cameras (blocking)."""
7145 │   294 │   │   images = OrderedDict()                                                             │
7146 │   295 │   │   for name, cam in self._cameras.items():                                            │
7147 │ ❱ 296 │   │   │   images[name] = cam.get_images()                                                │
7148 │   297 │   │   return images                                                                      │
7149 │   298 │                                                                                          │
7150 │   299 │   def get_camera_params(self) -> Dict[str, Dict[str, np.ndarray]]:                       │
7151 │                                                                                                  │
7152 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7153 │ i_skill2/sensors/camera.py:189 in get_images                                                     │
7154 │                                                                                                  │
7155 │   186 │   │   for name in self.texture_names:                                                    │
7156 │   187 │   │   │   dtype = self.TEXTURE_DTYPE[name]                                               │
7157 │   188 │   │   │   if dtype == "float":                                                           │
7158 │ ❱ 189 │   │   │   │   image = self.camera.get_float_texture(name)                                │
7159 │   190 │   │   │   elif dtype == "uint32":                                                        │
7160 │   191 │   │   │   │   image = self.camera.get_uint32_texture(name)                               │
7161 │   192 │   │   │   else:                                                                          │
7162 ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
7163 RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory

@StoneT2000
Copy link
Member

Can you try the code and also simultaneously checking with GPUs have memory being used by your code? If only one is being used then I may have some idea

@csmile-1006
Copy link
Author

csmile-1006 commented Feb 14, 2024

I attach the screenshot of GPU utilization when I run my code. (GPU 4)
Only one process is used.

image

@StoneT2000
Copy link
Member

In each sub-process, can you try setting the CUDA_VISIBLE_DEVICES to different values?

CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=1
etc.

Let me know if this works

@csmile-1006
Copy link
Author

Unfortunately, I have already used CUDA_VISIBLE_DEVICES=X with my code, but I found the same ErrorOutOfHostMemory error. 😢

@StoneT2000
Copy link
Member

StoneT2000 commented Feb 16, 2024

@fbxiang any idea about this? It seems like SAPIEN is just using one GPU for some reason?

Another question @csmile-1006 what task is this testing on?

Hopefully not a huge issue in the near future. We are currently working on MS3 which is even faster on a single gpu than MS2 was on multiple CPUs/GPUs for visual observations, hopefully releasing a usable version in a month or two.

@fbxiang
Copy link
Contributor

fbxiang commented Feb 16, 2024

I think I know the cause of this issue but I do not have a workaround for ManiSkill2. I believe this issue is resolved in our latest GPU parallel env (in development).

In ManiSkill2, each camera creates a fence, and the GPU has a limited synchronization primitive (fence) count globally, so creating a lot of cameras could hit this limit. In the latest SAPIEN we allow synchronizing all cameras together to avoid this issue.

@csmile-1006
Copy link
Author

@StoneT2000 I am now dealing with PickSingleEGAD-v0 and TurnFaucet-v0

@StoneT2000
Copy link
Member

@csmile-1006 we have just released a beta version of ManiSkill 3 which may resolve your issues. The two tasks you used are currently not ported over to ManiSkill 3 (we do not plan to port over the EGAD task, TurnFaucet will probably be ported over). But the other tasks can be tested.

@StoneT2000
Copy link
Member

Closing the issue as it is stale now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants