Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called after throwing an instance of 'c10::Error' #597

Closed
CRyan2016 opened this issue Jul 16, 2023 · 3 comments
Closed

terminate called after throwing an instance of 'c10::Error' #597

CRyan2016 opened this issue Jul 16, 2023 · 3 comments
Labels
bug Something isn't working high priority (first issues that will be worked on)

Comments

@CRyan2016
Copy link

CRyan2016 commented Jul 16, 2023

when i user qlora, c10:error is threw

  • bit4 is used
  • quantization_config=BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
    ),
  • optim is paged_adamw_32bit

use bitsandbytes==0.39.1 transformers==4.30.2

Traceback (most recent call last):
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 586, in
main()
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 558, in main
train_result = trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/root/.local/lib/python3.8/site-packages/accelerate/optimizer.py", line 140, in step
self.optimizer.step(closure)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/root/.local/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 270, in step
torch.cuda.synchronize()
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8958528457 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x64 (0x7f89584f23ec in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f8983bd0c64 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f8983ba80dc in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x244 (0x7f8983bab054 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f89ae2a8e23 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f89585089e0 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8958508af9 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f89ae506c68 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7f89ae506f85 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x110632 (0x5617c44ba632 in ./python_bin)
frame #11: + 0x110059 (0x5617c44ba059 in ./python_bin)
frame #12: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #13: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #14: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #15: + 0x177ce7 (0x5617c4521ce7 in ./python_bin)
frame #16: PyDict_SetItemString + 0x4c (0x5617c4524d8c in ./python_bin)
frame #17: PyImport_Cleanup + 0xaa (0x5617c4597a2a in ./python_bin)
frame #18: Py_FinalizeEx + 0x79 (0x5617c45fd4c9 in ./python_bin)
frame #19: Py_RunMain + 0x1bc (0x5617c460083c in ./python_bin)
frame #20: Py_BytesMain + 0x39 (0x5617c4600c29 in ./python_bin)
frame #21: __libc_start_main + 0xf2 (0x7f89d0c5d192 in /lib64/libc.so.6)
frame #22: + 0x1f9ad7 (0x5617c45a3ad7 in ./python_bin)

Fatal Python error: Aborted

Thread 0x00007f86b1640640 (most recent call first):

Current thread 0x00007f89d02c8cc0 (most recent call first):

@TimDettmers
Copy link
Collaborator

Please run again with CUDA_LAUNCH_BLOCKING=1 python ... to get the error message. This is a later error message not related to the error.

@TimDettmers TimDettmers added bug Something isn't working high priority (first issues that will be worked on) labels Jul 16, 2023
@CRyan2016
Copy link
Author

@TimDettmers
when i run again with CUDA_LAUNCH_BLOCKING=1 python ... , i get the error message

Error an illegal memory access was encountered at line 117 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

  • torch version 1.13.1+cu116
  • bitsandbytes==0.39.1 transformers==4.30.2

@CRyan2016 CRyan2016 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 17, 2023
@CRyan2016 CRyan2016 reopened this Jul 17, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority (first issues that will be worked on)
Projects
None yet
Development

No branches or pull requests

2 participants