terminate called after throwing an instance of 'c10::Error' #597

CRyan2016 · 2023-07-16T11:36:26Z

when i user qlora, c10:error is threw

bit4 is used
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
),
optim is paged_adamw_32bit

use bitsandbytes==0.39.1 transformers==4.30.2

Traceback (most recent call last):
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 586, in
main()
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 558, in main
train_result = trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/root/.local/lib/python3.8/site-packages/accelerate/optimizer.py", line 140, in step
self.optimizer.step(closure)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/root/.local/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 270, in step
torch.cuda.synchronize()
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8958528457 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f89584f23ec in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f8983bd0c64 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f8983ba80dc in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x244 (0x7f8983bab054 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f89ae2a8e23 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f89585089e0 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8958508af9 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f89ae506c68 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7f89ae506f85 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x110632 (0x5617c44ba632 in ./python_bin)
frame #11: + 0x110059 (0x5617c44ba059 in ./python_bin)
frame #12: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #13: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #14: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #15: + 0x177ce7 (0x5617c4521ce7 in ./python_bin)
frame #16: PyDict_SetItemString + 0x4c (0x5617c4524d8c in ./python_bin)
frame #17: PyImport_Cleanup + 0xaa (0x5617c4597a2a in ./python_bin)
frame #18: Py_FinalizeEx + 0x79 (0x5617c45fd4c9 in ./python_bin)
frame #19: Py_RunMain + 0x1bc (0x5617c460083c in ./python_bin)
frame #20: Py_BytesMain + 0x39 (0x5617c4600c29 in ./python_bin)
frame #21: __libc_start_main + 0xf2 (0x7f89d0c5d192 in /lib64/libc.so.6)
frame #22: + 0x1f9ad7 (0x5617c45a3ad7 in ./python_bin)

Fatal Python error: Aborted

Thread 0x00007f86b1640640 (most recent call first):

Current thread 0x00007f89d02c8cc0 (most recent call first):

TimDettmers · 2023-07-16T16:14:06Z

Please run again with CUDA_LAUNCH_BLOCKING=1 python ... to get the error message. This is a later error message not related to the error.

CRyan2016 · 2023-07-17T11:50:18Z

@TimDettmers
when i run again with CUDA_LAUNCH_BLOCKING=1 python ... , i get the error message

Error an illegal memory access was encountered at line 117 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

torch version 1.13.1+cu116
bitsandbytes==0.39.1 transformers==4.30.2

github-actions · 2023-12-20T15:15:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

TimDettmers added bug Something isn't working high priority (first issues that will be worked on) labels Jul 16, 2023

CRyan2016 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 17, 2023

CRyan2016 reopened this Jul 17, 2023

github-actions bot closed this as completed Dec 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminate called after throwing an instance of 'c10::Error' #597

terminate called after throwing an instance of 'c10::Error' #597

CRyan2016 commented Jul 16, 2023 •

edited

Loading

TimDettmers commented Jul 16, 2023

CRyan2016 commented Jul 17, 2023

github-actions bot commented Dec 20, 2023

terminate called after throwing an instance of 'c10::Error' #597

terminate called after throwing an instance of 'c10::Error' #597

Comments

CRyan2016 commented Jul 16, 2023 • edited Loading

when i user qlora, c10:error is threw

use bitsandbytes==0.39.1 transformers==4.30.2

TimDettmers commented Jul 16, 2023

CRyan2016 commented Jul 17, 2023

github-actions bot commented Dec 20, 2023

CRyan2016 commented Jul 16, 2023 •

edited

Loading