[ErrorOutOfHostMemory] server shut down in environment interaction #209

csmile-1006 · 2024-02-10T01:30:08Z

I am conducting reinforcement learning experiments using ManiSKill2 and frequently encounter errors such as
RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory.

Following this error, the GPU server shuts down, a problem recurring on multiple servers. Do you happen to be familiar with this issue, or do you have any solutions?

Below is the detailed traceback for your reference:

 Traceback (most recent call last):
42   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
43     return _run_code(code, main_globals, None,
44   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/runpy.py", line 87, in _run_code
45     exec(code, run_globals)
46   File "/data2/changyeon/NeurIPS2024/rlpd/maniskill_train_finetuning_pixels.py", line 250, in <module>
47     app.run(main)
48   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/absl/app.py", line 308, in run
49     _run_main(main, args)
50   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
51     sys.exit(main(argv))
52   File "/data2/changyeon/NeurIPS2024/rlpd/maniskill_train_finetuning_pixels.py", line 197, in main
53     next_observation, reward, done, info = env.step(action)
54   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/wandb_video.py", line 49, in step
55     obs, reward, done, info = super().step(action)
56   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gym/core.py", line 280, in step
57     return self.env.step(action)
58   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gym/wrappers/record_episode_statistics.py", line 28, in step
59     observations, rewards, dones, infos = super().step(action)
60   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gym/core.py", line 280, in step
61     return self.env.step(action)
62   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/frame_stack.py", line 45, in step
63     obs, reward, done, info = self.env.step(action)
64   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/repeat_action.py", line 16, in step
65     obs, reward, done, info = self.env.step(action)
66   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/maniskill_wrapper.py", line 704, in step
67     ob, rew, terminated, truncated, info = super().step(action)
68   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py", line 461, in step
69     return self.env.step(action)
70   File "/data2/changyeon/NeurIPS2024/rlpd/rlpd/wrappers/maniskill_wrapper.py", line 219, in step
71     next_obs, reward, terminated, truncated, info = super(ManiSkill2_ObsWrapper, self).step(action)
72   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py", line 522, in step
73     observation, reward, terminated, truncated, info = self.env.step(action)
74   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/wrappers/time_limit.py", line 57, in step
75     observation, reward, terminated, truncated, info = self.env.step(action)
76   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/wrappers/order_enforcing.py", line 56, in step
77     return self.env.step(action)
78   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py", line 522, in step
79     observation, reward, terminated, truncated, info = self.env.step(action)
80   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 557, in step
81     obs = self.get_obs()
82   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 263, in get_obs
83     return self._get_obs_images()
84   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 318, in _get_obs_images
85     image=self.get_images(),
86   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/envs/sapien_env.py", line 296, in get_images
87     images[name] = cam.get_images()
88   File "/home/changyeon/miniconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/mani_skill2/sensors/camera.py", line 189, in get_images
89     image = self.camera.get_float_texture(name)
90 RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory

The text was updated successfully, but these errors were encountered:

StoneT2000 · 2024-02-10T21:44:24Z

What command are you running? How many parallel environments are you using and are you using the ManiSkill vecenv?

csmile-1006 · 2024-02-11T11:31:09Z

I encountered this error on multiple occasions:

I executed the script from ManiSkill2-Learn on 2 GPUs, with one experiment running on each GPU and utilizing num_procs = 16 for parallel environment parameters. (This issue manifested across two distinct servers.)
I run my custom rlpd with ManiSKill2 envs. I use a single environment using Gym, and I ran 4 experiments in 4 GPUs (one experiment for each GPU, I use a single env for each experiment).

I found that there was no error when I ran the experiment in only 2 GPUs in case 2. However, when I increased the number of experiments, despite the VRAM of my GPUs being sufficiently ample, the aforementioned RunTimeError emerged.

csmile-1006 · 2024-02-14T00:59:46Z

I got the similar error again in case 2. 😢

7086 │                                                                                                  │
7087 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/wrappers/order_enforc │
7088 │ ing.py:56 in step                                                                                │
7089 │                                                                                                  │
7090 │   53 │   │   """Steps through the environment with `kwargs`."""                                  │
7091 │   54 │   │   if not self._has_reset:                                                             │
7092 │   55 │   │   │   raise ResetNeeded("Cannot call env.step() before calling env.reset()")          │
7093 │ ❱ 56 │   │   return self.env.step(action)                                                        │
7094 │   57 │                                                                                           │
7095 │   58 │   def reset(self, **kwargs):                                                              │
7096 │   59 │   │   """Resets the environment with `kwargs`."""                                         │
7097 │                                                                                                  │
7098 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/gymnasium/core.py:522 in step   │
7099 │                                                                                                  │
7100 │   519 │   │   self, action: ActType                                                              │
7101 │   520 │   ) -> tuple[WrapperObsType, SupportsFloat, bool, bool, dict[str, Any]]:                 │
7102 │   521 │   │   """Modifies the :attr:`env` after calling :meth:`step` using :meth:`self.observa   │
7103 │ ❱ 522 │   │   observation, reward, terminated, truncated, info = self.env.step(action)           │
7104 │   523 │   │   return self.observation(observation), reward, terminated, truncated, info          │
7105 │   524 │                                                                                          │
7106 │   525 │   def observation(self, observation: ObsType) -> WrapperObsType:                         │
7107 │                                                                                                  │
7108 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7109 │ i_skill2/envs/sapien_env.py:557 in step                                                          │
7110 │                                                                                                  │
7111 │   554 │   │   self.step_action(action)                                                           │
7112 │   555 │   │   self._elapsed_steps += 1                                                           │
7113 │   556 │   │                                                                                      │
7114 │ ❱ 557 │   │   obs = self.get_obs()                                                               │
7115 │   558 │   │   info = self.get_info(obs=obs)                                                      │
7116 │   559 │   │   reward = self.get_reward(obs=obs, action=action, info=info)                        │
7117 │   560 │   │   terminated = self.get_done(obs=obs, info=info)                                     │
7118 │                                                                                                  │
7119 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7120 │ i_skill2/envs/sapien_env.py:263 in get_obs                                                       │
7121 │                                                                                                  │
7122 │   260 │   │   elif self._obs_mode == "state_dict":                                               │
7123 │   261 │   │   │   return self._get_obs_state_dict()                                              │
7124 │   262 │   │   elif self._obs_mode == "image":                                                    │
7125 │ ❱ 263 │   │   │   return self._get_obs_images()                                                  │
7126 │   264 │   │   else:                                                                              │
7127 │   265 │   │   │   raise NotImplementedError(self._obs_mode)                                      │
7128 │   266                                                                                            │
7129 │                                                                                                  │
7130 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7131 │ i_skill2/envs/sapien_env.py:318 in _get_obs_images                                               │
7132 │                                                                                                  │
7133 │   315 │   │   │   agent=self._get_obs_agent(),                                                   │
7134 │   316 │   │   │   extra=self._get_obs_extra(),                                                   │
7135 │   317 │   │   │   camera_param=self.get_camera_params(),                                         │
7136 │ ❱ 318 │   │   │   image=self.get_images(),                                                       │
7137 │   319 │   │   )                                                                                  │
7138 │   320 │                                                                                          │
7139 │   321 │   @property                                                                              │
7140 │                                                                                                  │
7141 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7142 │ i_skill2/envs/sapien_env.py:296 in get_images                                                    │
7143 │                                                                                                  │
7144 │   293 │   │   """Get (raw) images from all cameras (blocking)."""                                │
7145 │   294 │   │   images = OrderedDict()                                                             │
7146 │   295 │   │   for name, cam in self._cameras.items():                                            │
7147 │ ❱ 296 │   │   │   images[name] = cam.get_images()                                                │
7148 │   297 │   │   return images                                                                      │
7149 │   298 │                                                                                          │
7150 │   299 │   def get_camera_params(self) -> Dict[str, Dict[str, np.ndarray]]:                       │
7151 │                                                                                                  │
7152 │ /home/changyeon/anaconda3/envs/arpv2/lib/python3.8/site-packages/mani_skill2-0.5.3-py3.8.egg/man │
7153 │ i_skill2/sensors/camera.py:189 in get_images                                                     │
7154 │                                                                                                  │
7155 │   186 │   │   for name in self.texture_names:                                                    │
7156 │   187 │   │   │   dtype = self.TEXTURE_DTYPE[name]                                               │
7157 │   188 │   │   │   if dtype == "float":                                                           │
7158 │ ❱ 189 │   │   │   │   image = self.camera.get_float_texture(name)                                │
7159 │   190 │   │   │   elif dtype == "uint32":                                                        │
7160 │   191 │   │   │   │   image = self.camera.get_uint32_texture(name)                               │
7161 │   192 │   │   │   else:                                                                          │
7162 ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
7163 RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory

StoneT2000 · 2024-02-14T01:32:06Z

Can you try the code and also simultaneously checking with GPUs have memory being used by your code? If only one is being used then I may have some idea

csmile-1006 · 2024-02-14T10:53:08Z

I attach the screenshot of GPU utilization when I run my code. (GPU 4)
Only one process is used.

StoneT2000 · 2024-02-15T23:43:28Z

In each sub-process, can you try setting the CUDA_VISIBLE_DEVICES to different values?

CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=1
etc.

Let me know if this works

csmile-1006 · 2024-02-16T08:55:54Z

Unfortunately, I have already used CUDA_VISIBLE_DEVICES=X with my code, but I found the same ErrorOutOfHostMemory error. 😢

StoneT2000 · 2024-02-16T22:17:14Z

@fbxiang any idea about this? It seems like SAPIEN is just using one GPU for some reason?

Another question @csmile-1006 what task is this testing on?

Hopefully not a huge issue in the near future. We are currently working on MS3 which is even faster on a single gpu than MS2 was on multiple CPUs/GPUs for visual observations, hopefully releasing a usable version in a month or two.

fbxiang · 2024-02-16T22:37:26Z

I think I know the cause of this issue but I do not have a workaround for ManiSkill2. I believe this issue is resolved in our latest GPU parallel env (in development).

In ManiSkill2, each camera creates a fence, and the GPU has a limited synchronization primitive (fence) count globally, so creating a lot of cameras could hit this limit. In the latest SAPIEN we allow synchronizing all cameras together to avoid this issue.

csmile-1006 · 2024-02-17T03:45:03Z

@StoneT2000 I am now dealing with PickSingleEGAD-v0 and TurnFaucet-v0

StoneT2000 · 2024-05-02T16:44:24Z

@csmile-1006 we have just released a beta version of ManiSkill 3 which may resolve your issues. The two tasks you used are currently not ported over to ManiSkill 3 (we do not plan to port over the EGAD task, TurnFaucet will probably be ported over). But the other tasks can be tested.

StoneT2000 · 2024-06-05T05:00:24Z

Closing the issue as it is stale now

StoneT2000 closed this as completed Jun 5, 2024

MasterXiong mentioned this issue Sep 3, 2024

Error when evaluation in parallel environments simpler-env/SimplerEnv#36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ErrorOutOfHostMemory] server shut down in environment interaction #209

[ErrorOutOfHostMemory] server shut down in environment interaction #209

csmile-1006 commented Feb 10, 2024

StoneT2000 commented Feb 10, 2024

csmile-1006 commented Feb 11, 2024

csmile-1006 commented Feb 14, 2024

StoneT2000 commented Feb 14, 2024

csmile-1006 commented Feb 14, 2024 •

edited

Loading

StoneT2000 commented Feb 15, 2024

csmile-1006 commented Feb 16, 2024

StoneT2000 commented Feb 16, 2024 •

edited

Loading

fbxiang commented Feb 16, 2024

csmile-1006 commented Feb 17, 2024

StoneT2000 commented May 2, 2024

StoneT2000 commented Jun 5, 2024

[ErrorOutOfHostMemory] server shut down in environment interaction #209

[ErrorOutOfHostMemory] server shut down in environment interaction #209

Comments

csmile-1006 commented Feb 10, 2024

StoneT2000 commented Feb 10, 2024

csmile-1006 commented Feb 11, 2024

csmile-1006 commented Feb 14, 2024

StoneT2000 commented Feb 14, 2024

csmile-1006 commented Feb 14, 2024 • edited Loading

StoneT2000 commented Feb 15, 2024

csmile-1006 commented Feb 16, 2024

StoneT2000 commented Feb 16, 2024 • edited Loading

fbxiang commented Feb 16, 2024

csmile-1006 commented Feb 17, 2024

StoneT2000 commented May 2, 2024

StoneT2000 commented Jun 5, 2024

csmile-1006 commented Feb 14, 2024 •

edited

Loading

StoneT2000 commented Feb 16, 2024 •

edited

Loading