Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning #302

Open
Delaunay opened this issue Oct 4, 2024 · 0 comments
Open

lightning #302

Delaunay opened this issue Oct 4, 2024 · 0 comments
Labels

Comments

@Delaunay
Copy link
Collaborator

Delaunay commented Oct 4, 2024

WHY is initialize_distributed_hpu always called on import !?

        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 114, in <module>
        |         initialize_distributed_hpu()

Full trace

    * 5 x         is set correctly.
        | Traceback (most recent call last):
        |   File "/homes/delaunap/milabench/benchmarks/lightning/main.py", line 11, in <module>
        |   File "/homes/delaunap/milabench/benchmarks/lightning/main.py", line 11, in <module>
        |         import torchcompat.core as acceleratorimport torchcompat.core as accelerator
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/__init__.py", line 19, in <module>
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/__init__.py", line 19, in <module>
        |     device_module = load_available()
        | device_module = load_available()
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 103, in load_available
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 103, in load_available
        |         devices = load_plugins()devices = load_plugins()
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 64, in load_plugins
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 64, in load_plugins
        |     devices = discover_plugins(torchcompat.plugins)
        |     devices = discover_plugins(torchcompat.plugins)  File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 48, in discover_plugins
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 48, in discover_plugins
        |     backend = importlib.import_module(name)
        |       File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
        | backend = importlib.import_module(name)
        |   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
        |     return _bootstrap._gcd_import(name[level:], package, level)
        | return _bootstrap._gcd_import(name[level:], package, level)
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/plugins/gaudi/__init__.py", line 11, in <module>
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/plugins/gaudi/__init__.py", line 11, in <module>
        |         import habana_frameworks.torch.core as htcoreimport habana_frameworks.torch.core as htcore
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/__init__.py", line 41, in <module>
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/__init__.py", line 41, in <module>
        |         import habana_frameworks.torch.distributed.hcclimport habana_frameworks.torch.distributed.hccl
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 114, in <module>
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 114, in <module>
        |         initialize_distributed_hpu()initialize_distributed_hpu()
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 104, in initialize_distributed_hpu
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 104, in initialize_distributed_hpu
        |     _setup_module_id(local_rank=local_rank, world_size=world_size)
        | _setup_module_id(local_rank=local_rank, world_size=world_size)
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 42, in _setup_module_id
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 42, in _setup_module_id
        |         assert local_rank < len(assert local_rank < len(
        | AssertionErrorAssertionError: : There is not enough devices
        |         available for training. Please verify if HABANA_VISIBLE_MODULES
        |         is set correctly.There is not enough devices
        |         available for training. Please verify if HABANA_VISIBLE_MODULES
        |         is set correctly.
    * 1 x [rank: 1] Child process with PID 1019102 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
        | Traceback (most recent call last):
        |   File "/homes/delaunap/milabench/benchmarks/lightning/main.py", line 11, in <module>
        |     import torchcompat.core as accelerator
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/__init__.py", line 19, in <module>
        |     device_module = load_available()
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 103, in load_available
        |     devices = load_plugins()
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 64, in load_plugins
        |     devices = discover_plugins(torchcompat.plugins)
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/core/load.py", line 48, in discover_plugins
        |     backend = importlib.import_module(name)
        |   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
        |     return _bootstrap._gcd_import(name[level:], package, level)
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/torchcompat/plugins/gaudi/__init__.py", line 11, in <module>
        |     import habana_frameworks.torch.core as htcore
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/__init__.py", line 41, in <module>
        |     import habana_frameworks.torch.distributed.hccl
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 114, in <module>
        |     initialize_distributed_hpu()
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 104, in initialize_distributed_hpu
        |     _setup_module_id(local_rank=local_rank, world_size=world_size)
        |   File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/distributed/hccl/__init__.py", line 42, in _setup_module_id
        |     assert local_rank < len(
        | AssertionError: There is not enough devices
        |         available for training. Please verify if HABANA_VISIBLE_MODULES
        |         is set correctly.
        | [rank: 1] Child process with PID 1019102 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

@Delaunay Delaunay added the HPU label Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant