using A100（40G）*8 gpus server to train T5-3b，it reports OOM resource is exhausted problem #1043

flyingwaters · 2022-08-10T14:57:27Z

Why ？I used model：8 batch 1 settings and 8 batchsize {input：1024， output：512}， there is still a OOM。
But I see Pytorch can train T5-3b at V100 32G * 8 ， whether It was caused by Mesh-tensorflow‘s less efficient than deepspeed？？
I want to figure out the problem？？ actually I want use T5 because Tensorflow is easy for deployment. Why deepspeed can train larger model than mesh_tensorflow????

Tian14267 · 2023-05-12T01:45:33Z

@flyingwaters 哈喽你好。请问下你是怎么训练起来这个T5模型的？我一直想自己训练一个预训练模型的，哪怕是小参数量小数据量的都行。但是我发现这个项目都是用的云TPU。想能否请教你一下~

YuxiangLee1224 · 2023-06-06T08:11:27Z

您好，打扰一下，想请问这个是怎么使用的啊，怎么和huggingface上的完全不一样啊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using A100（40G）*8 gpus server to train T5-3b，it reports OOM resource is exhausted problem #1043

using A100（40G）*8 gpus server to train T5-3b，it reports OOM resource is exhausted problem #1043

flyingwaters commented Aug 10, 2022

Tian14267 commented May 12, 2023

YuxiangLee1224 commented Jun 6, 2023

using A100（40G）*8 gpus server to train T5-3b，it reports OOM resource is exhausted problem #1043

using A100（40G）*8 gpus server to train T5-3b，it reports OOM resource is exhausted problem #1043

Comments

flyingwaters commented Aug 10, 2022

Tian14267 commented May 12, 2023

YuxiangLee1224 commented Jun 6, 2023