-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not an issue but a question for going forwards #227
Comments
Here is a similar issue: #12 Thank you for your interest in our project. LLaMA is a multilingual model and does have some proficiency in Chinese. Considering the lack of a strong Chinese base, we chose to use LLaMA as the foundation. Given sufficient hardware resources, full-scale fine-tuning would certainly yield better results compared to using Lora, such as with FastChat's Vicuna. The method of expanding the vocabulary for Chinese-LLaMA-Alpaca also requires extensive pretraining, which can be done if the hardware conditions are adequate. LLaMA itself utilizes encoding mechanisms that can encode many Chinese characters, but achieving one-to-one encoding is relatively limited, hence the need for vocabulary expansion. |
Hi,
I found that this repo is focusing ONLY on fine-tuning (with LoRA) for Chinese language. However, LLaMA was trained mostly on English-corpus, with about 30,000 vocab size which is VERY small with English-focus LLM.
How would you describe the quality / perplexity of the result (7B or 13B) with purely LoRA only, without expending Chinese vocab before fine-tuning ? Would you suggest that full fine-tuning / or LoRA fine-tuning but with large corpus (non-instruct) is a better way to go ?
I am about to train Vietnamese for LLaMA, hence would like to know more about your experiences. I also referring to https:/ymcui/Chinese-LLaMA-Alpaca which said that pre-training LoRA with large corpus + expansion of vocab should be done first, so I am a bit confused.
Thanks for any input.
Steve
The text was updated successfully, but these errors were encountered: