DOCCI: Descriptions of Connected and Contrasting Images | SpringerLink
Skip to main content

DOCCI: Descriptions of Connected and Contrasting Images

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation – a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.

Y. Onoe and S. Rane—Equal contribution.

S. Rane and J. Cho—Work done as a Student Researcher at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This number includes both primary and secondary objects. DSG often detects nested objects (e.g., tires of a car), leading to a higher count of objects detected.

  2. 2.

    To ensure that DOCCI remains purely annotated by humans, we do not alter or modify descriptions based on suggested errors.

  3. 3.

    We used 4,966 examples as DALL-E 3’s content filter rejected 34 rewritten prompts (Our descriptions do not contain any sensitive content.).

References

  1. Adobe: Adobe Firefly (2023)

    Google Scholar 

  2. Agrawal, H., et al.: nocaps: novel object captioning at scale. In: ICCV (2019)

    Google Scholar 

  3. Anil, R., et al.: PaLM 2 technical report. arXiv (2023)

    Google Scholar 

  4. Bakr, E.M., Sun, P., Shen, X., Khan, F.F., Li, L.E., Elhoseiny, M.: HRS-Bench: holistic, reliable and scalable benchmark for text-to-image models. In: ICCV (2023)

    Google Scholar 

  5. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)

    Google Scholar 

  6. Chang, H., Zhang, H., et al.: Muse: text-To-image generation via masked generative transformers. In: ICML (2023)

    Google Scholar 

  7. Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)

    Google Scholar 

  8. Chen, X., et al.: PaLI-3 vision language models: smaller, faster, stronger (2023)

    Google Scholar 

  9. Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. In: ICLR (2023)

    Google Scholar 

  10. Chen, X., et al.: Microsoft COCO Captions: data collection and evaluation server. arXiv (2015)

    Google Scholar 

  11. Cho, J., et al.: Davidsonian scene graph: improving reliability in fine-grained evaluation for text-image generation. In: ICLR (2024)

    Google Scholar 

  12. Cho, J., Zala, A., Bansal, M.: DALL-Eval: probing the reasoning skills and social biases of text-to-image generation models. In: ICCV (2023)

    Google Scholar 

  13. Conwell, C., Ullman, T.D.: Testing relational understanding in text-guided image generation. arXiv (2022)

    Google Scholar 

  14. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  15. Desai, K., Kaul, G., Aysola, Z.T., Johnson, J.: RedCaps: web-curated image-text data created by the people, for the people. In: NeurIPS: Datasets and Benchmarks Track (2021)

    Google Scholar 

  16. Doveh, S., et al.: Dense and aligned captions (DAC) promote compositional reasoning in VL models. In: NeurIPS (2023)

    Google Scholar 

  17. Freitag, M., Grangier, D., Caswell, I.: BLEU might be guilty but references are not innocent. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) EMNLP (2020)

    Google Scholar 

  18. Fu, S., et al.: DreamSim: learning new dimensions of human visual similarity using synthetic data. arXiv (2023)

    Google Scholar 

  19. Gardner, M., et al.: Evaluating models’ local decision boundaries via contrast sets. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of EMNLP (2020)

    Google Scholar 

  20. Gebru, T., et al.: Datasheets for datasets (2021)

    Google Scholar 

  21. Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning (2021)

    Google Scholar 

  22. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium (2018)

    Google Scholar 

  23. Hu, Y., et al.: TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering. In: CVPR (2023)

    Google Scholar 

  24. Hutchinson, B., Baldridge, J., Prabhakaran, V.: Underspecification in scene description-to-depiction tasks (2022)

    Google Scholar 

  25. Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking FID: towards a better evaluation metric for image generation (2024)

    Google Scholar 

  26. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)

    Google Scholar 

  27. Kasai, J., et al.: Transparent human evaluation for image captioning. In: NAACL (2022)

    Google Scholar 

  28. Kincaid, P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel (1975)

    Google Scholar 

  29. Kirillov, A., et al.: Segment anything. arXiv (2023)

    Google Scholar 

  30. Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification. Dataset https://github.com/openimages (2016)

  31. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)

    Google Scholar 

  32. Krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  33. Laughlin, G.H.M.: SMOG grading-a new readability formula. J. Read. 12(8), 639–646 (1969)

    Google Scholar 

  34. Lee, T., et al.: Holistic evaluation of text-to-image models (2023)

    Google Scholar 

  35. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)

    Google Scholar 

  36. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  37. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  38. Midjourney: Midjourney (2022)

    Google Scholar 

  39. Ohta, S., Fukui, N., Sakai, K.L.: Computational principles of syntax in the regions specialized for language: integrating theoretical linguistics and functional neuroimaging (2013)

    Google Scholar 

  40. OpenAI, et al.: GPT-4 technical report (2024)

    Google Scholar 

  41. OpenAI: GPT-4V(ision) system card (2022)

    Google Scholar 

  42. OpenAI: DALL\(\cdot \)E 3 system card (2023)

    Google Scholar 

  43. Otani, M., et al.: Toward verifiable and reproducible human evaluation for text-to-image generation. In: CVPR (2023)

    Google Scholar 

  44. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)

    Google Scholar 

  45. Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: ICLR (2024)

    Google Scholar 

  46. Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting vision and language with localized narratives. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 647–664. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_38

    Chapter  Google Scholar 

  47. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  48. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents (2022)

    Google Scholar 

  49. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  50. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)

    Google Scholar 

  51. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

    Google Scholar 

  52. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)

    Google Scholar 

  53. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS: Datasets and Benchmarks Track (2022)

    Google Scholar 

  54. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  55. Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: SIGIR (2021)

    Google Scholar 

  56. Stein, G., et al.: Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In: NeurIPS (2023)

    Google Scholar 

  57. Thapliyal, A., Pont-Tuset, J., Chen, X., Soricut, R.: CrossModal-3600: a massively multilingual multimodal evaluation dataset. In: EMNLP (2022)

    Google Scholar 

  58. Thomee, B., et al.: YFCC100M: the new data in multimedia research. ACM Commun. 59, 64–73 (2016)

    Article  Google Scholar 

  59. Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A picture is worth more than 77 text tokens: evaluating CLIP-style models on dense captions (2023)

    Google Scholar 

  60. Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)

    Google Scholar 

  61. Wang, S., et al.: Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. In: CVPR (2023)

    Google Scholar 

  62. Yarom, M.: What you see is what you read? Improving text-image alignment evaluation. In: NeurIPS (2023)

    Google Scholar 

  63. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: TACL (2014)

    Google Scholar 

  64. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022)

    Google Scholar 

  65. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3), 5 (2022)

Download references

Acknowledgement

First of all, we would like to express our gratitude to all members of the annotator team for their diligent and hard work on a very challenging and long-running task. We also give a huge thanks to Soravit Changpinyo and Radu Soricut for their thorough review and the constructive feedback provided on our paper. Many thanks also to Cristina Vasconcelos and Brian Gordon for their support with our experiments, and to Andrea Burns for insightful suggestions for the paper. Finally, Jason Baldridge is incredibly grateful to his family members who contributed by helping arrange scenes, taking pictures, and being patient while he took so many pictures – Cheryl, Olivia, Nash, Gray, and Esme Baldridge and Mary and Justin Reusch – and to pets Ivy, Tiger, DD and Yoshi for their roles as frequent subjects.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasumasa Onoe .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3587 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Onoe, Y. et al. (2025). DOCCI: Descriptions of Connected and Contrasting Images. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15118. Springer, Cham. https://doi.org/10.1007/978-3-031-73027-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73027-6_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73026-9

  • Online ISBN: 978-3-031-73027-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics