
This dataset contains 500 images from the MS COCO test split and the Conceptual Captions validation split. The team also created FineCapEval, a new benchmark dataset for evaluating fine-grained image captioning models.

They also introduced a two-layer perceptron classifier head that detects whether a sentence is grammatically correct, training this jointly with the text-encoder fine-tuning. However, the team found that this model often generated grammatically incorrect captions, for example, by repeating words: "several rows of planes parked outside a terminal window area with fog outside a terminal window motion position area motion." Their solution was to fine-tune the text-encoder portion of CLIP, by providing negative examples with randomly repeated, inserted, or shuffled tokens. The researchers used this CLIP score to create a reward function, CLIP-S, for RL training to produce their captioning model. CLIP measures the similarity between an image and a text string the more closely the text describes the image, the higher the similarity. To address this problem, the Adobe team chose to use OpenAI's CLIP model to measure the accuracy of the generated captions.

However, this often results in models that generate generic captions that describe only the prominent objects in the image, ignoring fine details that make the image distinctive. Many image captioning models are trained on datasets consisting of input images and reference captions the training objective measures the similarity of the generated caption to the reference caption, using metrics such as BLEU. This makes models trained to maximize textual similarity with reference captions tend to generate less distinctive captions that ignore the fine detailed aspects of an image that distinguishes it from others. The reference captions of public datasets often describe only the most prominent objects in the images. To address the shortcomings of existing image-captioning evaluation methods, the team also developed a new benchmark dataset, FineCapEval, which includes more fine-grained image captions describing image backgrounds and relations between objects. To improve the grammar of the generated captions, the team fine-tuned CLIP with negative caption examples, which were generated by randomly modifying reference captions. During training, the model uses CLIP to determine how well the generated caption describes the image this score is used as a reward signal for reinforcement learning (RL). CLIP-S uses a Transformer model to generate captions given an input image. The model and experiments were described in a paper submitted to the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). In evaluations with captions generated by other models, human judges preferred those generated by CLIP-S a majority of the time. Researchers from Adobe and the University of North Carolina (UNC) have open-sourced CLIP-S, an image-captioning AI model that produces fine-grained descriptions of images.
