Abstract
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation – a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.
Y. Onoe and S. Rane—Equal contribution.
S. Rane and J. Cho—Work done as a Student Researcher at Google.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This number includes both primary and secondary objects. DSG often detects nested objects (e.g., tires of a car), leading to a higher count of objects detected.
- 2.
To ensure that DOCCI remains purely annotated by humans, we do not alter or modify descriptions based on suggested errors.
- 3.
We used 4,966 examples as DALL-E 3’s content filter rejected 34 rewritten prompts (Our descriptions do not contain any sensitive content.).
References
Adobe: Adobe Firefly (2023)
Agrawal, H., et al.: nocaps: novel object captioning at scale. In: ICCV (2019)
Anil, R., et al.: PaLM 2 technical report. arXiv (2023)
Bakr, E.M., Sun, P., Shen, X., Khan, F.F., Li, L.E., Elhoseiny, M.: HRS-Bench: holistic, reliable and scalable benchmark for text-to-image models. In: ICCV (2023)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)
Chang, H., Zhang, H., et al.: Muse: text-To-image generation via masked generative transformers. In: ICML (2023)
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
Chen, X., et al.: PaLI-3 vision language models: smaller, faster, stronger (2023)
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. In: ICLR (2023)
Chen, X., et al.: Microsoft COCO Captions: data collection and evaluation server. arXiv (2015)
Cho, J., et al.: Davidsonian scene graph: improving reliability in fine-grained evaluation for text-image generation. In: ICLR (2024)
Cho, J., Zala, A., Bansal, M.: DALL-Eval: probing the reasoning skills and social biases of text-to-image generation models. In: ICCV (2023)
Conwell, C., Ullman, T.D.: Testing relational understanding in text-guided image generation. arXiv (2022)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Desai, K., Kaul, G., Aysola, Z.T., Johnson, J.: RedCaps: web-curated image-text data created by the people, for the people. In: NeurIPS: Datasets and Benchmarks Track (2021)
Doveh, S., et al.: Dense and aligned captions (DAC) promote compositional reasoning in VL models. In: NeurIPS (2023)
Freitag, M., Grangier, D., Caswell, I.: BLEU might be guilty but references are not innocent. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) EMNLP (2020)
Fu, S., et al.: DreamSim: learning new dimensions of human visual similarity using synthetic data. arXiv (2023)
Gardner, M., et al.: Evaluating models’ local decision boundaries via contrast sets. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of EMNLP (2020)
Gebru, T., et al.: Datasheets for datasets (2021)
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning (2021)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium (2018)
Hu, Y., et al.: TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering. In: CVPR (2023)
Hutchinson, B., Baldridge, J., Prabhakaran, V.: Underspecification in scene description-to-depiction tasks (2022)
Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking FID: towards a better evaluation metric for image generation (2024)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Kasai, J., et al.: Transparent human evaluation for image captioning. In: NAACL (2022)
Kincaid, P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel (1975)
Kirillov, A., et al.: Segment anything. arXiv (2023)
Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification. Dataset https://github.com/openimages (2016)
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)
Krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Laughlin, G.H.M.: SMOG grading-a new readability formula. J. Read. 12(8), 639–646 (1969)
Lee, T., et al.: Holistic evaluation of text-to-image models (2023)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Midjourney: Midjourney (2022)
Ohta, S., Fukui, N., Sakai, K.L.: Computational principles of syntax in the regions specialized for language: integrating theoretical linguistics and functional neuroimaging (2013)
OpenAI, et al.: GPT-4 technical report (2024)
OpenAI: GPT-4V(ision) system card (2022)
OpenAI: DALL\(\cdot \)E 3 system card (2023)
Otani, M., et al.: Toward verifiable and reproducible human evaluation for text-to-image generation. In: CVPR (2023)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: ICLR (2024)
Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting vision and language with localized narratives. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 647–664. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_38
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS: Datasets and Benchmarks Track (2022)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: SIGIR (2021)
Stein, G., et al.: Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In: NeurIPS (2023)
Thapliyal, A., Pont-Tuset, J., Chen, X., Soricut, R.: CrossModal-3600: a massively multilingual multimodal evaluation dataset. In: EMNLP (2022)
Thomee, B., et al.: YFCC100M: the new data in multimedia research. ACM Commun. 59, 64–73 (2016)
Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A picture is worth more than 77 text tokens: evaluating CLIP-style models on dense captions (2023)
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Wang, S., et al.: Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. In: CVPR (2023)
Yarom, M.: What you see is what you read? Improving text-image alignment evaluation. In: NeurIPS (2023)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: TACL (2014)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3), 5 (2022)
Acknowledgement
First of all, we would like to express our gratitude to all members of the annotator team for their diligent and hard work on a very challenging and long-running task. We also give a huge thanks to Soravit Changpinyo and Radu Soricut for their thorough review and the constructive feedback provided on our paper. Many thanks also to Cristina Vasconcelos and Brian Gordon for their support with our experiments, and to Andrea Burns for insightful suggestions for the paper. Finally, Jason Baldridge is incredibly grateful to his family members who contributed by helping arrange scenes, taking pictures, and being patient while he took so many pictures – Cheryl, Olivia, Nash, Gray, and Esme Baldridge and Mary and Justin Reusch – and to pets Ivy, Tiger, DD and Yoshi for their roles as frequent subjects.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Onoe, Y. et al. (2025). DOCCI: Descriptions of Connected and Contrasting Images. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15118. Springer, Cham. https://doi.org/10.1007/978-3-031-73027-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-73027-6_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73026-9
Online ISBN: 978-3-031-73027-6
eBook Packages: Computer ScienceComputer Science (R0)