Flickr30k (#285)

* max instances for debugging * b * printing devices * moving tensors? * self * p * p * l * l * fixing heap? * stop logging and printing * less prints * printing devices * p * . * devices * device * test * testing not sampling * testing not using model again * test not moving tensors * not printing * trying image subset * debugging model * going back to full (model is slow?) * right number of instances * distribut * more potential hard negatives * non-distributed * distributed + adding another seen set * fixing evaluation method * format * testing fixed eval * fixing variable * fixing training var * testing new eval again * fix * fix * fix? * changing k to 5 * float * moving labels to gpu * long * trying hopefully fixed loss function * fix * testing out the whole thing * setting max instances to debug in distributed * debug stuff * fixing num images * hopefully fixing the dataset reader in dist * full data * testing out brand new changes * deleting some old comments * fixing validation bug * testing on 1 gpu for now * feature cache broken? * switching to tensor fields and stuff * fix * print device * trying to not move the batch? * moving small batches to cpu * not printing device * deleting old tensor? * debug * printing memory allocation * moving tensor to cpu immediately? * deleting batch? * debug * debug * debug * does this work? * switching to eval and no grad * fix * mask list * backbone roll * typo * log * debug * notes * testing no grad * testing validation batch size of 1 * bug * didn't have the right variable? * don't need to softmax? * trying flickr30k with 8 batch and dummy captions * full flickr? * batch size 1 * testing training batches for validation * Testing out val stuff * updating reader (test will fail for now) * debug statements to figure out why val isn't worki * testing if top images always have same scores * getting rid of caption debugging step * using the right caption var * updating reader to mirror vilbert training setup * full dataset (dummy caption embeddings) * switching to real caption embeddings * testing caching hard negatives * log * limit instances to test caching * delete faiss * more cache tests * one more log statement * single epoch to calculate hard negatives * need to import logging * don't log misses anymore (too slow) * using consistent hash function (test # instances) * Flickr30k batching (#277) merge main + caching captions * test caching captions and hard negatives on full * don't log cache hits * logging training labels to debug * switching val to 4 way mc * can we overfit * not 1k instances * not logging + overfit * not overfit * even fewer instances * all instances * even more overfitting * back to normal * b * bkac to normal * log loss and stuff again * reset * don't include hard negatives in case there's a bug * batch size of 1 * more epochs * only correct answer and hard negatives * Cleanup * Fix error in caption caching * Find hard negatives even when we don't have enough instances * O(1) algorithm for finding a random number with one exception * Make sure the wrong caption comes from a different image * Cross entropy loss * trying overfitting with full instances * use full dataset without learning rate scheduler * don't limit instances and don't log * batch size, scheduler, wandb * comment out wandb * full dataset no hard negatives * don't log loss * giving the correct answer a cheat word * use local feature cache * logging cache stuff * different local feature cache dir * switching to cheat box * bug * something up with some boxes * no cheating and no hard negatives * seeing is a really big batch size works * bug * testing 64 bs * batch size 32 * batch size 48 * full training with 32 batch size no hard negatives * more gradient accumulation steps * trying to train with 10% of the data * fix * bumping up the learning rate, don't correct bias * gradient accumulation + hard negatives * use local feature cache * changing params back * trying real validation * no hard negatives * hard negatives and not real validation * no hard negatives + real validation * calc hn * fixing predictors * fix * fix * fix * fix * cleaning up PR (in progress) * cleaning things up * more cleanup * change warmup steps * only validate every ~5 epochs * printing shapes * more logging * fix log * try cat instead of stack * different logging * test * fix * try batches per epoch * bug * get rid of log statement * use local feature cache * log * logging cache miss * switching back to old captions to use cache * switching back to preprocesing captions * using nfs * Disabling hard negatives to test epoch strat * not logging cache misses * write to local cache (faster) * epoch multiplier * no hard negatives * hard negatives * lowering number of warmup steps * no hard negatives * hard negatives * no hard negatives * hard negatives * Trying Jiasen's featurizer (1x epoch mult) * null image stuff * null image * don't featurize captions (no hn) * adding vilbert ir model tests * cleanup + test distributed * cleanup + dist * test distributed * don't use shard_iterable * fix feature dir * changelog * reformat * log shapes * removing unused vars * using old features * style * lint * lint * don't log shapes * lint * fixing type * debug * changing test files to hopefully fix test * using cloud link for data dir * cleanup * delete print * comment * cleanup * fixing test assert * committing a bunch of fixes * not distributed * fixing metrics * Adding test files + upping max instances * fixes * Switching back to nfs cache * renaming n * update comment * fix * making test deterministic? * sorting files to hopefully achieve consistency Co-authored-by: Dirk Groeneveld <[email protected]>
allenai · Jun 25, 2021 · e47da99 · e47da99
1 parent fb35b2d
commit e47da99
Show file tree

Hide file tree

Showing 32 changed files with 1,178 additions and 2 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added `StanfordSentimentTreeBankDatasetReader.apply_token_indexers()` to add token_indexers rather than in `text_to_instance` 
 - Added `AdversarialBiasMitigator` tests.
 - Added `adversarial-binary-gender-bias-mitigated-roberta-snli` model.
+- Added support for Flickr30k image retrieval, including a dataset reader, a model, and a training config.
 
 ### Fixed
 

diff --git a/allennlp_models/vision/dataset_readers/__init__.py b/allennlp_models/vision/dataset_readers/__init__.py
@@ -4,3 +4,4 @@
 from allennlp_models.vision.dataset_readers.vgqa import VGQAReader
 from allennlp_models.vision.dataset_readers.vqav2 import VQAv2Reader
 from allennlp_models.vision.dataset_readers.visual_entailment import VisualEntailmentReader
+from allennlp_models.vision.dataset_readers.flickr30k import Flickr30kReader
diff --git a/allennlp_models/vision/dataset_readers/flickr30k.py b/allennlp_models/vision/dataset_readers/flickr30k.py
diff --git a/allennlp_models/vision/dataset_readers/vision_reader.py b/allennlp_models/vision/dataset_readers/vision_reader.py
@@ -96,11 +96,13 @@ def __init__(
  max_instances: Optional[int] = None,
  image_processing_batch_size: int = 8,
  write_to_cache: bool = True,
+ manual_distributed_sharding: bool = True,
+ manual_multiprocess_sharding: bool = True,
  ) -> None:
  super().__init__(
  max_instances=max_instances,
- manual_distributed_sharding=True,
- manual_multiprocess_sharding=True,
+ manual_distributed_sharding=manual_distributed_sharding,
+ manual_multiprocess_sharding=manual_multiprocess_sharding,
  )
 
  # tokenizers and indexers

diff --git a/allennlp_models/vision/models/__init__.py b/allennlp_models/vision/models/__init__.py
@@ -1,6 +1,7 @@
 from allennlp_models.vision.models.nlvr2 import Nlvr2Model
 from allennlp_models.vision.models.vision_text_model import VisionTextModel
 from allennlp_models.vision.models.visual_entailment import VisualEntailmentModel
+from allennlp_models.vision.models.vilbert_image_retrieval import ImageRetrievalVilbert
 from allennlp_models.vision.models.vilbert_vqa import VqaVilbert
 from allennlp_models.vision.models.heads.vqa_head import VqaHead
 from allennlp_models.vision.models.heads.visual_entailment_head import VisualEntailmentHead
diff --git a/allennlp_models/vision/models/vilbert_image_retrieval.py b/allennlp_models/vision/models/vilbert_image_retrieval.py
@@ -0,0 +1,138 @@
+import logging
+from typing import Dict
+
+from overrides import overrides
+import torch
+
+from allennlp.data import TextFieldTensors, Vocabulary
+from allennlp.models.model import Model
+from allennlp.modules.transformer import (
+ TransformerEmbeddings,
+ ImageFeatureEmbeddings,
+ BiModalEncoder,
+)
+from allennlp.training.metrics import CategoricalAccuracy
+from torch.nn import CrossEntropyLoss
+
+from allennlp_models.vision.models.vision_text_model import VisionTextModel
+
+logger = logging.getLogger(__name__)
+
+
+@Model.register("vilbert_ir")
+@Model.register("vilbert_ir_from_huggingface", constructor="from_huggingface_model_name")
+class ImageRetrievalVilbert(VisionTextModel):
+ """
+ Model for image retrieval task based on the VilBERT paper.
+
+ # Parameters
+
+ vocab : `Vocabulary`
+ text_embeddings : `TransformerEmbeddings`
+ image_embeddings : `ImageFeatureEmbeddings`
+ encoder : `BiModalEncoder`
+ pooled_output_dim : `int`
+ fusion_method : `str`, optional (default = `"mul"`)
+ dropout : `float`, optional (default = `0.1`)
+ label_namespace : `str`, optional (default = `answers`)
+ k: `int`, optional (default = `1`)
+ """
+
+ def __init__(
+ self,
+ vocab: Vocabulary,
+ text_embeddings: TransformerEmbeddings,
+ image_embeddings: ImageFeatureEmbeddings,
+ encoder: BiModalEncoder,
+ pooled_output_dim: int,
+ fusion_method: str = "mul",
+ dropout: float = 0.1,
+ k: int = 1,
+ *,
+ ignore_text: bool = False,
+ ignore_image: bool = False,
+ ) -> None:
+ super().__init__(
+ vocab,
+ text_embeddings,
+ image_embeddings,
+ encoder,
+ pooled_output_dim,
+ fusion_method,
+ dropout,
+ is_multilabel=False,
+ ignore_text=ignore_text,
+ ignore_image=ignore_image,
+ )
+ self.classifier = torch.nn.Linear(pooled_output_dim, 1)
+
+ self.top_1_acc = CategoricalAccuracy()
+ self.top_5_acc = CategoricalAccuracy(top_k=5)
+ self.top_10_acc = CategoricalAccuracy(top_k=10)
+ self.loss = CrossEntropyLoss()
+
+ self.k = k
+
+ @overrides
+ def forward(
+ self, # type: ignore
+ box_features: torch.Tensor,
+ box_coordinates: torch.Tensor,
+ box_mask: torch.Tensor,
+ caption: TextFieldTensors,
+ label: torch.Tensor,
+ ) -> Dict[str, torch.Tensor]:
+ batch_size = box_features.shape[0]
+
+ if self.training:
+ # Shape: (batch_size, num_images, pooled_output_dim)
+ pooled_output = self.backbone(box_features, box_coordinates, box_mask, caption)[
+ "pooled_boxes_and_text"
+ ]
+
+ # Shape: (batch_size, num_images)
+ logits = self.classifier(pooled_output).squeeze(-1)
+ probs = torch.softmax(logits, dim=-1)
+ else:
+ with torch.no_grad():
+ # Shape: (batch_size, num_images, pooled_output_dim)
+ pooled_output = self.backbone(box_features, box_coordinates, box_mask, caption)[
+ "pooled_boxes_and_text"
+ ]
+
+ # Shape: (batch_size, num_images)
+ logits = self.classifier(pooled_output).squeeze(-1)
+ probs = torch.softmax(logits, dim=-1)
+
+ outputs = {"logits": logits, "probs": probs}
+ outputs = self._compute_loss_and_metrics(batch_size, outputs, label)
+ return outputs
+
+ @overrides
+ def _compute_loss_and_metrics(
+ self,
+ batch_size: int,
+ outputs: torch.Tensor,
+ labels: torch.Tensor,
+ ):
+ outputs["loss"] = self.loss(outputs["logits"], labels) / batch_size
+ self.top_1_acc(outputs["logits"], labels)
+ self.top_5_acc(outputs["logits"], labels)
+ self.top_10_acc(outputs["logits"], labels)
+ return outputs
+
+ @overrides
+ def get_metrics(self, reset: bool = False) -> Dict[str, float]:
+ return {
+ "top_1_acc": self.top_1_acc.get_metric(reset),
+ "top_5_acc": self.top_5_acc.get_metric(reset),
+ "top_10_acc": self.top_10_acc.get_metric(reset),
+ }
+
+ @overrides
+ def make_output_human_readable(
+ self, output_dict: Dict[str, torch.Tensor]
+ ) -> Dict[str, torch.Tensor]:
+ return output_dict
+
+ default_predictor = "vilbert_ir"
diff --git a/allennlp_models/vision/predictors/__init__.py b/allennlp_models/vision/predictors/__init__.py
@@ -1,2 +1,3 @@
+from allennlp_models.vision.predictors.vilbert_ir import VilbertImageRetrievalPredictor
 from allennlp_models.vision.predictors.vilbert_vqa import VilbertVqaPredictor
 from allennlp_models.vision.predictors.visual_entailment import VisualEntailmentPredictor
diff --git a/allennlp_models/vision/predictors/vilbert_ir.py b/allennlp_models/vision/predictors/vilbert_ir.py
@@ -0,0 +1,40 @@
+from typing import List, Dict
+
+from overrides import overrides
+import numpy
+
+from allennlp.common.file_utils import cached_path
+from allennlp.common.util import JsonDict
+from allennlp.data import Instance
+from allennlp.data.fields import LabelField
+from allennlp.predictors.predictor import Predictor
+
+
+@Predictor.register("vilbert_ir")
+class VilbertImageRetrievalPredictor(Predictor):
+ def predict(self, image: str, caption: str) -> JsonDict:
+ image = cached_path(image)
+ return self.predict_json({"caption": caption, "image": image})
+
+ @overrides
+ def _json_to_instance(self, json_dict: JsonDict) -> Instance:
+ from allennlp_models.vision.dataset_readers.flickr30k import Flickr30kReader
+
+ caption = json_dict["caption"]
+ image = cached_path(json_dict["image"])
+ if isinstance(self._dataset_reader, Flickr30kReader):
+ return self._dataset_reader.text_to_instance(caption, image, use_cache=False)
+ else:
+ raise ValueError(
+ f"Dataset reader is of type f{self._dataset_reader.__class__.__name__}. "
+ f"Expected {Flickr30kReader.__name__}."
+ )
+
+ @overrides
+ def predictions_to_labeled_instances(
+ self, instance: Instance, outputs: Dict[str, numpy.ndarray]
+ ) -> List[Instance]:
+ new_instance = instance.duplicate()
+ label = numpy.argmax(outputs["probs"])
+ new_instance.add_field("label", LabelField(int(label), skip_indexing=True))
+ return [new_instance]
diff --git a/test_fixtures/vision/flickr30k/experiment.jsonnet b/test_fixtures/vision/flickr30k/experiment.jsonnet
@@ -0,0 +1,80 @@
+local model_name = "epwalsh/bert-xsmall-dummy";
+
+{
+ "dataset_reader": {
+ "type": "flickr30k",
+ "image_dir": "test_fixtures/vision/images/flickr30k",
+ "data_dir": "test_fixtures/vision/flickr30k/sentences",
+ "image_loader": "torch",
+ "image_featurizer": "null",
+ "featurize_captions": false,
+ "region_detector": {
+ "type": "random",
+ "seed": 322
+ },
+ "tokenizer": {
+ "type": "pretrained_transformer",
+ "model_name": model_name
+ },
+ "token_indexers": {
+ "tokens": {
+ "type": "pretrained_transformer",
+ "model_name": model_name
+ }
+ }
+ },
+ "train_data_path": "test_fixtures/vision/flickr30k/tiny-dev.txt",
+ "validation_data_path": "test_fixtures/vision/flickr30k/tiny-dev.txt",
+ "model": {
+ "type": "vilbert_ir",
+ "text_embeddings": {
+ "vocab_size": 250,
+ "embedding_size": 20,
+ "pad_token_id": 0,
+ "max_position_embeddings": 512,
+ "type_vocab_size": 2,
+ "dropout": 0.0
+ },
+ "image_embeddings": {
+ "feature_size": 10,
+ "embedding_size": 200
+ },
+ "encoder": {
+ # text
+ "hidden_size1": 20,
+ "num_hidden_layers1": 1,
+ "intermediate_size1": 40,
+ "num_attention_heads1": 1,
+ "attention_dropout1": 0.1,
+ "hidden_dropout1": 0.1,
+ "biattention_id1": [0, 1],
+ "fixed_layer1": 0,
+
+ # vision
+ "hidden_size2": 200,
+ "num_hidden_layers2": 1,
+ "intermediate_size2": 50,
+ "num_attention_heads2": 1,
+ "attention_dropout2": 0.0,
+ "hidden_dropout2": 0.0,
+ "biattention_id2": [0, 1],
+ "fixed_layer2": 0,
+
+ "combined_num_attention_heads": 2,
+ "combined_hidden_size": 200,
+ "activation": "gelu",
+ },
+ "pooled_output_dim": 100,
+ "fusion_method": "sum",
+ },
+ "data_loader": {
+ "batch_size": 4
+ },
+ "trainer": {
+ "optimizer": {
+ "type": "huggingface_adamw",
+ "lr": 0.00005
+ },
+ "num_epochs": 1,
+ }
+}
diff --git a/test_fixtures/vision/flickr30k/experiment_from_huggingface.jsonnet b/test_fixtures/vision/flickr30k/experiment_from_huggingface.jsonnet
@@ -0,0 +1,60 @@
+local model_name = "epwalsh/bert-xsmall-dummy";
+{
+ "dataset_reader": {
+ "type": "flickr30k",
+ "image_dir": "test_fixtures/vision/images/flickr30k",
+ "data_dir": "test_fixtures/vision/flickr30k/sentences",
+ "image_loader": "torch",
+ "image_featurizer": "null",
+ "featurize_captions": false,
+ "region_detector": {
+ "type": "random",
+ "seed": 322
+ },
+ "tokenizer": {
+ "type": "pretrained_transformer",
+ "model_name": model_name
+ },
+ "token_indexers": {
+ "tokens": {
+ "type": "pretrained_transformer",
+ "model_name": model_name
+ }
+ }
+ },
+ "train_data_path": "test_fixtures/vision/flickr30k/tiny-dev.txt",
+ "validation_data_path": "test_fixtures/vision/flickr30k/tiny-dev.txt",
+ "model": {
+ "type": "vilbert_ir_from_huggingface",
+ "model_name": model_name,
+ "image_feature_dim": 10,
+ "image_num_hidden_layers": 1,
+ "image_hidden_size": 200,
+ "image_num_attention_heads": 1,
+ "image_intermediate_size": 50,
+ "image_attention_dropout": 0.0,
+ "image_hidden_dropout": 0.0,
+ "image_biattention_id": [0, 1],
+ "image_fixed_layer": 0,
+
+ "text_biattention_id": [0, 1],
+ "text_fixed_layer": 0,
+
+ "combined_hidden_size": 200,
+ "combined_num_attention_heads": 4,
+
+ "pooled_output_dim": 100,
+ "fusion_method": "sum",
+ "pooled_dropout": 0.0,
+ },
+ "data_loader": {
+ "batch_size": 32
+ },
+ "trainer": {
+ "optimizer": {
+ "type": "huggingface_adamw",
+ "lr": 0.00005
+ },
+ "num_epochs": 1,
+ }
+}
diff --git a/test_fixtures/vision/flickr30k/sentences/1.txt b/test_fixtures/vision/flickr30k/sentences/1.txt
@@ -0,0 +1,5 @@
+[/EN#221796/people A girl] with [/EN#221804/bodyparts brown hair] sits on [/EN#221799/scene the edge of a cement area] [/EN#221798/scene overlooking water] .
+[/EN#221796/people A woman] in [/EN#221797/clothing black] , seen from [/EN#221800/other behind] , sits next to [/EN#221798/scene a body of water] .
+[/EN#221796/people A girl] sitting outside on [/EN#221799/other concrete] near [/EN#221798/scene water] in [/EN#221797/clothing a black dress] .
+[/EN#221796/people A small girl] sits on [/EN#221799/other a ledge] by [/EN#221798/scene the water] contemplating [/EN#221802/other life] .
+[/EN#221796/people A dark-haired girl] is sitting on [/EN#221798/scene the waters edge] .
diff --git a/test_fixtures/vision/flickr30k/sentences/2.txt b/test_fixtures/vision/flickr30k/sentences/2.txt
@@ -0,0 +1,5 @@
+[/EN#221796/people A girl] with [/EN#221804/bodyparts brown hair] sits on [/EN#221799/scene the edge of a concrete area] [/EN#221798/scene overlooking water] .
+[/EN#221796/people A woman] in [/EN#221797/clothing black] , seen from [/EN#221800/other behind] , sits by [/EN#221798/scene a body of water] .
+[/EN#221796/people A girl] sitting outside on [/EN#221799/other cement] near [/EN#221798/scene water] in [/EN#221797/clothing a black dress] .
+[/EN#221796/people A small girl] sits on [/EN#221799/other an edge] by [/EN#221798/scene the water] contemplating [/EN#221802/other life] .
+[/EN#221796/people A dark-haired girl] is sitting next to [/EN#221798/scene the waters edge] .
diff --git a/test_fixtures/vision/flickr30k/sentences/3.txt b/test_fixtures/vision/flickr30k/sentences/3.txt
@@ -0,0 +1,5 @@
+[/EN#221796/people A girl] without [/EN#221804/bodyparts brown hair] sits on [/EN#221799/scene the edge of a cement area] [/EN#221798/scene overlooking water] .
+[/EN#221796/people A woman] wearing [/EN#221797/clothing black] , seen from [/EN#221800/other behind] , sits next to [/EN#221798/scene a body of water] .
+[/EN#221796/people A girl] sitting inside on [/EN#221799/other concrete] near [/EN#221798/scene water] in [/EN#221797/clothing a black dress] .
+[/EN#221796/people A small girl] sits on top of [/EN#221799/other a ledge] by [/EN#221798/scene the water] contemplating [/EN#221802/other life] .
+[/EN#221796/people A dark-haired girl] is sitting by [/EN#221798/scene the waters edge] .
diff --git a/test_fixtures/vision/flickr30k/sentences/4945942737.txt b/test_fixtures/vision/flickr30k/sentences/4945942737.txt
@@ -0,0 +1,5 @@
+[/EN#221796/people A girl] with [/EN#221804/bodyparts brown hair] sits on [/EN#221799/scene the edge of a cement area] [/EN#221798/scene overlooking water] .
+[/EN#221796/people A woman] in [/EN#221797/clothing black] , seen from [/EN#221800/other behind] , sits next to [/EN#221798/scene a body of water] .
+[/EN#221796/people A girl] sitting outside on [/EN#221799/other concrete] near [/EN#221798/scene water] in [/EN#221797/clothing a black dress] .
+[/EN#221796/people A small girl] sits on [/EN#221799/other a ledge] by [/EN#221798/scene the water] contemplating [/EN#221802/other life] .
+[/EN#221796/people A dark-haired girl] is sitting on [/EN#221798/scene the waters edge] .
diff --git a/test_fixtures/vision/flickr30k/sentences/6338542128.txt b/test_fixtures/vision/flickr30k/sentences/6338542128.txt
@@ -0,0 +1,5 @@
+On [/EN#253080/scene a sunny , dry day] , wearing [/EN#253081/other full football gear] , [/EN#253069/people a Texas A&M football player] tries to reach [/EN#253070/people an Iowa State football player] , for [/EN#253072/other the football] during [/EN#253078/other the game] .
+[/EN#253070/people An offensive player] running with [/EN#253077/other a football] while [/EN#253069/people a football player] tries to stop [/EN#0/notvisual him] during [/EN#253071/other a football game] .
+[/EN#253069/people A football player] from [/EN#253074/scene Iowa State blocks] [/EN#253069/people a player] from [/EN#253075/other Texas A&M] from taking [/EN#253072/other the football] from [/EN#0/notvisual him] .
+[/EN#253070/scene The Iowa State football player blocks] [/EN#253068/people a Texas A&M defenseman] while running with [/EN#253072/other the ball] .
+[/EN#253073/other # 8] for [/EN#253083/bodyparts Iowa State stiff arms] [/EN#253069/people a Texas AM player] attempting to tackle [/EN#0/notvisual him] .
diff --git a/test_fixtures/vision/flickr30k/test.txt b/test_fixtures/vision/flickr30k/test.txt
@@ -0,0 +1,5 @@
+6338542128
+4945942737
+1
+2
+3