DOCCI: Descriptions of Connected and Contrasting Images

Onoe, Yasumasa; Rane, Sunayana; Berger, Zachary; Bitton, Yonatan; Cho, Jaemin; Garg, Roopal; Ku, Alexander; Parekh, Zarana; Pont-Tuset, Jordi; Tanzer, Garrett; Wang, Su; Baldridge, Jason

doi:10.1007/978-3-031-73027-6_17

Yasumasa Onoe¹³,
Sunayana Rane¹⁴,
Zachary Berger¹³,
Yonatan Bitton¹³,
Jaemin Cho¹⁵,
Roopal Garg¹³,
Alexander Ku¹³,
Zarana Parekh¹³,
Jordi Pont-Tuset¹³,
Garrett Tanzer¹³,
Su Wang¹³ &
…
Jason Baldridge¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15118))

Included in the following conference series:

European Conference on Computer Vision

152 Accesses

Abstract

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation – a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.

Y. Onoe and S. Rane—Equal contribution.

S. Rane and J. Cho—Work done as a Student Researcher at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Notes

1.
This number includes both primary and secondary objects. DSG often detects nested objects (e.g., tires of a car), leading to a higher count of objects detected.
2.
To ensure that DOCCI remains purely annotated by humans, we do not alter or modify descriptions based on suggested errors.
3.
We used 4,966 examples as DALL-E 3’s content filter rejected 34 rewritten prompts (Our descriptions do not contain any sensitive content.).

References

Adobe: Adobe Firefly (2023)
Google Scholar
Agrawal, H., et al.: nocaps: novel object captioning at scale. In: ICCV (2019)
Google Scholar
Anil, R., et al.: PaLM 2 technical report. arXiv (2023)
Google Scholar
Bakr, E.M., Sun, P., Shen, X., Khan, F.F., Li, L.E., Elhoseiny, M.: HRS-Bench: holistic, reliable and scalable benchmark for text-to-image models. In: ICCV (2023)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)
Google Scholar
Chang, H., Zhang, H., et al.: Muse: text-To-image generation via masked generative transformers. In: ICML (2023)
Google Scholar
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
Google Scholar
Chen, X., et al.: PaLI-3 vision language models: smaller, faster, stronger (2023)
Google Scholar
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. In: ICLR (2023)
Google Scholar
Chen, X., et al.: Microsoft COCO Captions: data collection and evaluation server. arXiv (2015)
Google Scholar
Cho, J., et al.: Davidsonian scene graph: improving reliability in fine-grained evaluation for text-image generation. In: ICLR (2024)
Google Scholar
Cho, J., Zala, A., Bansal, M.: DALL-Eval: probing the reasoning skills and social biases of text-to-image generation models. In: ICCV (2023)
Google Scholar
Conwell, C., Ullman, T.D.: Testing relational understanding in text-guided image generation. arXiv (2022)
Google Scholar
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Google Scholar
Desai, K., Kaul, G., Aysola, Z.T., Johnson, J.: RedCaps: web-curated image-text data created by the people, for the people. In: NeurIPS: Datasets and Benchmarks Track (2021)
Google Scholar
Doveh, S., et al.: Dense and aligned captions (DAC) promote compositional reasoning in VL models. In: NeurIPS (2023)
Google Scholar
Freitag, M., Grangier, D., Caswell, I.: BLEU might be guilty but references are not innocent. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) EMNLP (2020)
Google Scholar
Fu, S., et al.: DreamSim: learning new dimensions of human visual similarity using synthetic data. arXiv (2023)
Google Scholar
Gardner, M., et al.: Evaluating models’ local decision boundaries via contrast sets. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of EMNLP (2020)
Google Scholar
Gebru, T., et al.: Datasheets for datasets (2021)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning (2021)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium (2018)
Google Scholar
Hu, Y., et al.: TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering. In: CVPR (2023)
Google Scholar
Hutchinson, B., Baldridge, J., Prabhakaran, V.: Underspecification in scene description-to-depiction tasks (2022)
Google Scholar
Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking FID: towards a better evaluation metric for image generation (2024)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Kasai, J., et al.: Transparent human evaluation for image captioning. In: NAACL (2022)
Google Scholar
Kincaid, P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel (1975)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv (2023)
Google Scholar
Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification. Dataset https://github.com/openimages (2016)
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)
Google Scholar
Krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Laughlin, G.H.M.: SMOG grading-a new readability formula. J. Read. 12(8), 639–646 (1969)
Google Scholar
Lee, T., et al.: Holistic evaluation of text-to-image models (2023)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Midjourney: Midjourney (2022)
Google Scholar
Ohta, S., Fukui, N., Sakai, K.L.: Computational principles of syntax in the regions specialized for language: integrating theoretical linguistics and functional neuroimaging (2013)
Google Scholar
OpenAI, et al.: GPT-4 technical report (2024)
Google Scholar
OpenAI: GPT-4V(ision) system card (2022)
Google Scholar
OpenAI: DALL\(\cdot \)E 3 system card (2023)
Google Scholar
Otani, M., et al.: Toward verifiable and reproducible human evaluation for text-to-image generation. In: CVPR (2023)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: ICLR (2024)
Google Scholar
Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting vision and language with localized narratives. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 647–664. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_38
Chapter Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents (2022)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS: Datasets and Benchmarks Track (2022)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: SIGIR (2021)
Google Scholar
Stein, G., et al.: Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In: NeurIPS (2023)
Google Scholar
Thapliyal, A., Pont-Tuset, J., Chen, X., Soricut, R.: CrossModal-3600: a massively multilingual multimodal evaluation dataset. In: EMNLP (2022)
Google Scholar
Thomee, B., et al.: YFCC100M: the new data in multimedia research. ACM Commun. 59, 64–73 (2016)
Article Google Scholar
Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A picture is worth more than 77 text tokens: evaluating CLIP-style models on dense captions (2023)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Wang, S., et al.: Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. In: CVPR (2023)
Google Scholar
Yarom, M.: What you see is what you read? Improving text-image alignment evaluation. In: NeurIPS (2023)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: TACL (2014)
Google Scholar
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022)
Google Scholar
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3), 5 (2022)

Download references

Acknowledgement

First of all, we would like to express our gratitude to all members of the annotator team for their diligent and hard work on a very challenging and long-running task. We also give a huge thanks to Soravit Changpinyo and Radu Soricut for their thorough review and the constructive feedback provided on our paper. Many thanks also to Cristina Vasconcelos and Brian Gordon for their support with our experiments, and to Andrea Burns for insightful suggestions for the paper. Finally, Jason Baldridge is incredibly grateful to his family members who contributed by helping arrange scenes, taking pictures, and being patient while he took so many pictures – Cheryl, Olivia, Nash, Gray, and Esme Baldridge and Mary and Justin Reusch – and to pets Ivy, Tiger, DD and Yoshi for their roles as frequent subjects.

Author information

Authors and Affiliations

Google, Mountain View, USA
Yasumasa Onoe, Zachary Berger, Yonatan Bitton, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang & Jason Baldridge
Princeton University, Princeton, USA
Sunayana Rane
UNC Chapel Hill, Chapel Hill, USA
Jaemin Cho

Authors

Yasumasa Onoe
View author publications
You can also search for this author in PubMed Google Scholar
Sunayana Rane
View author publications
You can also search for this author in PubMed Google Scholar
Zachary Berger
View author publications
You can also search for this author in PubMed Google Scholar
Yonatan Bitton
View author publications
You can also search for this author in PubMed Google Scholar
Jaemin Cho
View author publications
You can also search for this author in PubMed Google Scholar
Roopal Garg
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Ku
View author publications
You can also search for this author in PubMed Google Scholar
Zarana Parekh
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Pont-Tuset
View author publications
You can also search for this author in PubMed Google Scholar
Garrett Tanzer
View author publications
You can also search for this author in PubMed Google Scholar
Su Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jason Baldridge
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasumasa Onoe .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3587 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Onoe, Y. et al. (2025). DOCCI: Descriptions of Connected and Contrasting Images. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15118. Springer, Cham. https://doi.org/10.1007/978-3-031-73027-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-73027-6_17
Published: 26 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73026-9
Online ISBN: 978-3-031-73027-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DOCCI: Descriptions of Connected and Contrasting Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3587 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DOCCI: Descriptions of Connected and Contrasting Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3587 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation