Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using A100(40G)*8 gpus server to train T5-3b,it reports OOM resource is exhausted problem #1043

Open
flyingwaters opened this issue Aug 10, 2022 · 2 comments

Comments

@flyingwaters
Copy link

Why ?I used model:8 batch 1 settings and 8 batchsize {input:1024, output:512}, there is still a OOM。
But I see Pytorch can train T5-3b at V100 32G * 8 , whether It was caused by Mesh-tensorflow‘s less efficient than deepspeed??
I want to figure out the problem?? actually I want use T5 because Tensorflow is easy for deployment. Why deepspeed can train larger model than mesh_tensorflow????

@Tian14267
Copy link

@flyingwaters 哈喽你好。请问下你是怎么训练起来这个T5模型的?我一直想自己训练一个预训练模型的,哪怕是小参数量小数据量的都行。但是我发现这个项目都是用的云TPU。想能否请教你一下~

@YuxiangLee1224
Copy link

您好,打扰一下,想请问这个是怎么使用的啊,怎么和huggingface上的完全不一样啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants