❓ [Question] Can't reproduce imagenet results of RN50 model trained on `pixparse/cc3m-wds` #930

clownrat6 · 2024-08-24T03:06:45Z

This is my scripts:

torchrun --nnodes 1 \
    --nproc_per_node 8 \
    -m open_clip_train.main \
    --model RN50 \
    --train-data 'datasets/cc3m/cc3m-train-{0000..0575}.tar' \
    --train-num-samples 2905954 \
    --dataset-type webdataset \
    --imagenet-val datasets/imagenet-1k/val \
    --batch-size 128 \
    --accum-freq 1 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs=32 \
    --warmup 10000 \
    --grad-checkpointing \
    --precision amp_bf16 \
    --workers 32 \
    --log-every-n-steps 5 \
    --logs ./work_dirs/ \
    --name sample_bs128x8 \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \

According to the description in README, the results should be:

But the results produced by me:

Is there any problem of my scripts or dataset?

The text was updated successfully, but these errors were encountered:

rwightman · 2024-09-05T21:10:37Z

@clownrat6 not sure what's going on there, should be closer to 20. I uploaded that cc3m instance and I've trained to near 20 with it so if it downloaded without corrupting things it should be fine.

A few thoughts

that's a lot of warmup steps, would go with 1000 to 2000 instead of 10000.
you're using AMP + bfloat16, I don't think I've done a bf16 RN50 run, there may be precision concerns relating to batch norm default eps concerns, try just normal AMP + float16.
the majority of training runs I've done used these args, it should be fine without for this case, but why not

    --local-loss
    --gather-with-grad

there are signs of instability with that big drop @ epoch 15, related to my point re bfloat16, but possibly dialing back the beta2 slightly could mitigate, for non ViT it defaults to 0.999 but could try --beta2 0.99
could try a different --seed too if there's sensitivity wrt to a certain sequence of samples there

Unrelated to the core issue, but for training with these smaller datasets, I found extra augs to be helpful, eg:
--aug-cfg scale='(0.4, 1.0)' 're_prob=0.3'

clownrat6 changed the title ~~❓ [Question] Can't reproduce imagenet results of RN50 model trained on pixparse / cc3m-wds~~ ❓ [Question] Can't reproduce imagenet results of RN50 model trained on pixparse/cc3m-wds Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ [Question] Can't reproduce imagenet results of RN50 model trained on `pixparse/cc3m-wds` #930

❓ [Question] Can't reproduce imagenet results of RN50 model trained on `pixparse/cc3m-wds` #930

clownrat6 commented Aug 24, 2024 •

edited

Loading

rwightman commented Sep 5, 2024

❓ [Question] Can't reproduce imagenet results of RN50 model trained on pixparse/cc3m-wds #930

❓ [Question] Can't reproduce imagenet results of RN50 model trained on pixparse/cc3m-wds #930

Comments

clownrat6 commented Aug 24, 2024 • edited Loading

rwightman commented Sep 5, 2024

❓ [Question] Can't reproduce imagenet results of RN50 model trained on `pixparse/cc3m-wds` #930

❓ [Question] Can't reproduce imagenet results of RN50 model trained on `pixparse/cc3m-wds` #930

clownrat6 commented Aug 24, 2024 •

edited

Loading