Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❓ [Question] Can't reproduce imagenet results of RN50 model trained on pixparse/cc3m-wds #930

Open
clownrat6 opened this issue Aug 24, 2024 · 1 comment

Comments

@clownrat6
Copy link

clownrat6 commented Aug 24, 2024

This is my scripts:

torchrun --nnodes 1 \
    --nproc_per_node 8 \
    -m open_clip_train.main \
    --model RN50 \
    --train-data 'datasets/cc3m/cc3m-train-{0000..0575}.tar' \
    --train-num-samples 2905954 \
    --dataset-type webdataset \
    --imagenet-val datasets/imagenet-1k/val \
    --batch-size 128 \
    --accum-freq 1 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs=32 \
    --warmup 10000 \
    --grad-checkpointing \
    --precision amp_bf16 \
    --workers 32 \
    --log-every-n-steps 5 \
    --logs ./work_dirs/ \
    --name sample_bs128x8 \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \

According to the description in README, the results should be:
image
But the results produced by me:
image

Is there any problem of my scripts or dataset?

@clownrat6 clownrat6 changed the title ❓ [Question] Can't reproduce imagenet results of RN50 model trained on pixparse / cc3m-wds ❓ [Question] Can't reproduce imagenet results of RN50 model trained on pixparse/cc3m-wds Aug 24, 2024
@rwightman
Copy link
Collaborator

@clownrat6 not sure what's going on there, should be closer to 20. I uploaded that cc3m instance and I've trained to near 20 with it so if it downloaded without corrupting things it should be fine.

A few thoughts

  • that's a lot of warmup steps, would go with 1000 to 2000 instead of 10000.
  • you're using AMP + bfloat16, I don't think I've done a bf16 RN50 run, there may be precision concerns relating to batch norm default eps concerns, try just normal AMP + float16.
  • the majority of training runs I've done used these args, it should be fine without for this case, but why not
    --local-loss
    --gather-with-grad
  • there are signs of instability with that big drop @ epoch 15, related to my point re bfloat16, but possibly dialing back the beta2 slightly could mitigate, for non ViT it defaults to 0.999 but could try --beta2 0.99
  • could try a different --seed too if there's sensitivity wrt to a certain sequence of samples there

Unrelated to the core issue, but for training with these smaller datasets, I found extra augs to be helpful, eg:
--aug-cfg scale='(0.4, 1.0)' 're_prob=0.3'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants