-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the inference results of finetuned coca model is not as expected #751
Comments
I am also having problems with the CoCa training. Given enough time and further training the models give empty output for the caption predictions. |
@lilisandy, Hello, I git pulled the open_clip repository, and then edited in src/open_clip/coca_model.py the lines as exactly edited line per line in Pull Request #710 by gpucce, and then ran |
I use the example in the document to fine_tune coca, params are same with the example, I also add the pretrained params:
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nproc_per_node 4 -m training.main --dataset-type "csv" --train-data "path/to/data/dir/train2014.csv" --csv-img-key "filepath" --csv-caption-key "title" --csv-separator "\t" --warmup 1000 --batch-size 128 --lr 1e-5 --wd 0.1 --epochs 2 --workers 4 --model "coca_ViT-L-14" --pretrained "mscoco_finetuned_laion2B-s13B-b90k" --report-to "wandb" --coca-contrastive-loss-weight 0 --coca-caption-loss-weight 1 --log-every-n-steps 100
the csv dataset example is:
filepath title
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A restaurant has modern wooden tables and chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long restaurant table with rattan rounded back chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg a long table with a plant on top of it surrounded with wooden chairs
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long table with a flower arrangement in the middle for meetings
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A table is adorned with wooden chairs with blue accents.
However, the inference results of the trained model are not as expected,below is my inference code:
import open_clip
import torch
from PIL import Image
model, _, transform = open_clip.create_model_and_transforms(
model_name="coca_ViT-L-14",
pretrained="path/to/model/epoch_1.pt",
precision="amp"
)
im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(im)
print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))
but the inference result is:
"turnpike turnpike turnpike turnpike parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway"
The loss curve of two epochs is as follows, the convergence speed is very fast and quickly drops to 0, It seems something went wrong:
I want to know how to modify it to get the correct result?
The text was updated successfully, but these errors were encountered: