deep_speed initialization for models in the transformers library #85

DesperateExplorer · 2023-07-19T00:13:35Z

Dear authors,

I found that collie can not initialize DeepSpeed when using models in the transformers library. For example, when replace this line of script with the from_pretrained interface of the transformers library, to which any config of the type CollieConfig can not be passed, even the monitors can not be registered correctly since ds is not initialized (DeepSpeed backend not set, please initialize it using init_process_group()). Is there any workaround of this issue or Collie can only support training the internally reimplemented models?

The text was updated successfully, but these errors were encountered:

00INDEX · 2023-07-19T01:38:48Z

Hi @DesperateExplorer , Collie can use models from transformers, in the case of ZeRO parallelism. But you need to execute setup_distribution manually:

from collie import setup_distribution, CollieConfig
from transformers import AutoModelForCausalLM
model_name = "openlm-research/open_llama_7b_v2"
config = CollieConfig.from_pretrianed(model_name)
setup_distribution(config)
model = AutoModelForCausalLM.from_pretrained(model_name)

DesperateExplorer · 2023-07-19T06:33:27Z

Why is the memory consumption of the LLaMA-7B from transformers much larger than the internal implementation by Collie? Taking LLaMA-7B and AdamW for example, when using the internal implementation, train_micro_batch_size_per_gpu can be 2 and will not cause OOM for V100 on the ShareGPT dataset (max context = 2048), however, when using the transformers implementation, "train_micro_batch_size_per_gpu = 1" will cause OOM. Even switching to Lomo, I can not fit "train_micro_batch_size_per_gpu = 1" sample into the 32GB memory without OOM.

x54-729 · 2023-07-20T03:10:13Z

Collie's LLaMA used flash attetion as MHA, which can reduce memory usage. If your use_flash is True, the memory usage is less than transformers implementation

DesperateExplorer · 2023-07-20T03:16:28Z

Collie's LLaMA used flash attetion as MHA, which can reduce memory usage. If your use_flash is True, the memory usage is less than transformers implementation

Actually, not. On V100 (Volta architecture), any kind of flash attention is not supported.

Carol-gutianle · 2023-07-24T08:42:24Z

You can try to set the pretrained_config.gradient_checkpointing to True, just like this:

x54-729 · 2023-07-25T07:33:22Z

You can try to set the pretrained_config.gradient_checkpointing to True, just like this:

config.checkpointing=True also works now.

00INDEX added the help wanted Extra attention is needed label Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deep_speed initialization for models in the transformers library #85

deep_speed initialization for models in the transformers library #85

DesperateExplorer commented Jul 19, 2023

00INDEX commented Jul 19, 2023

DesperateExplorer commented Jul 19, 2023

x54-729 commented Jul 20, 2023

DesperateExplorer commented Jul 20, 2023

Carol-gutianle commented Jul 24, 2023 •

edited

Loading

x54-729 commented Jul 25, 2023

deep_speed initialization for models in the transformers library #85

deep_speed initialization for models in the transformers library #85

Comments

DesperateExplorer commented Jul 19, 2023

00INDEX commented Jul 19, 2023

DesperateExplorer commented Jul 19, 2023

x54-729 commented Jul 20, 2023

DesperateExplorer commented Jul 20, 2023

Carol-gutianle commented Jul 24, 2023 • edited Loading

x54-729 commented Jul 25, 2023

Carol-gutianle commented Jul 24, 2023 •

edited

Loading