the inference results of finetuned coca model is not as expected #751

lilisandy · 2023-11-29T12:20:41Z

I use the example in the document to fine_tune coca, params are same with the example, I also add the pretrained params:

CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nproc_per_node 4 -m training.main --dataset-type "csv" --train-data "path/to/data/dir/train2014.csv" --csv-img-key "filepath" --csv-caption-key "title" --csv-separator "\t" --warmup 1000 --batch-size 128 --lr 1e-5 --wd 0.1 --epochs 2 --workers 4 --model "coca_ViT-L-14" --pretrained "mscoco_finetuned_laion2B-s13B-b90k" --report-to "wandb" --coca-contrastive-loss-weight 0 --coca-caption-loss-weight 1 --log-every-n-steps 100

the csv dataset example is:
filepath title
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A restaurant has modern wooden tables and chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long restaurant table with rattan rounded back chairs.
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg a long table with a plant on top of it surrounded with wooden chairs
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A long table with a flower arrangement in the middle for meetings
/path/train_data/coca_train/train2014/COCO_train2014_000000057870.jpg A table is adorned with wooden chairs with blue accents.

However, the inference results of the trained model are not as expected，below is my inference code：

import open_clip
import torch
from PIL import Image

model, _, transform = open_clip.create_model_and_transforms(
model_name="coca_ViT-L-14",
pretrained="path/to/model/epoch_1.pt",
precision="amp"
)

im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)

with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(im)

print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))

but the inference result is:
"turnpike turnpike turnpike turnpike parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway parkway"

The loss curve of two epochs is as follows, the convergence speed is very fast and quickly drops to 0, It seems something went wrong：

I want to know how to modify it to get the correct result？

Thomas2419 · 2023-12-24T03:55:29Z

I am also having problems with the CoCa training. Given enough time and further training the models give empty output for the caption predictions.

Thomas2419 · 2023-12-26T22:45:35Z

@lilisandy, Hello, I git pulled the open_clip repository, and then edited in src/open_clip/coca_model.py the lines as exactly edited line per line in Pull Request #710 by gpucce, and then ran pip install -e . the repository's main directory to install it post edits. This change made the CoCa training successful for me. Cheers.

Thomas2419 mentioned this issue Jan 16, 2024

Questions about using COCa to generate captions #797

Open

rwightman closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the inference results of finetuned coca model is not as expected #751

the inference results of finetuned coca model is not as expected #751

lilisandy commented Nov 29, 2023 •

edited

Loading

Thomas2419 commented Dec 24, 2023

Thomas2419 commented Dec 26, 2023

the inference results of finetuned coca model is not as expected #751

the inference results of finetuned coca model is not as expected #751

Comments

lilisandy commented Nov 29, 2023 • edited Loading

Thomas2419 commented Dec 24, 2023

Thomas2419 commented Dec 26, 2023

lilisandy commented Nov 29, 2023 •

edited

Loading