-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terminate called after throwing an instance of 'c10::Error' #597
Comments
Please run again with |
@TimDettmers Error an illegal memory access was encountered at line 117 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
when i user qlora, c10:error is threw
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
),
use bitsandbytes==0.39.1 transformers==4.30.2
Traceback (most recent call last):
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 586, in
main()
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 558, in main
train_result = trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/root/.local/lib/python3.8/site-packages/accelerate/optimizer.py", line 140, in step
self.optimizer.step(closure)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/root/.local/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 270, in step
torch.cuda.synchronize()
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8958528457 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f89584f23ec in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f8983bd0c64 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f8983ba80dc in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x244 (0x7f8983bab054 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f89ae2a8e23 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f89585089e0 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8958508af9 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f89ae506c68 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7f89ae506f85 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x110632 (0x5617c44ba632 in ./python_bin)
frame #11: + 0x110059 (0x5617c44ba059 in ./python_bin)
frame #12: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #13: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #14: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #15: + 0x177ce7 (0x5617c4521ce7 in ./python_bin)
frame #16: PyDict_SetItemString + 0x4c (0x5617c4524d8c in ./python_bin)
frame #17: PyImport_Cleanup + 0xaa (0x5617c4597a2a in ./python_bin)
frame #18: Py_FinalizeEx + 0x79 (0x5617c45fd4c9 in ./python_bin)
frame #19: Py_RunMain + 0x1bc (0x5617c460083c in ./python_bin)
frame #20: Py_BytesMain + 0x39 (0x5617c4600c29 in ./python_bin)
frame #21: __libc_start_main + 0xf2 (0x7f89d0c5d192 in /lib64/libc.so.6)
frame #22: + 0x1f9ad7 (0x5617c45a3ad7 in ./python_bin)
Fatal Python error: Aborted
Thread 0x00007f86b1640640 (most recent call first):
Current thread 0x00007f89d02c8cc0 (most recent call first):
The text was updated successfully, but these errors were encountered: