Do Large Language Models Understand Conversational Implicature – A Case Study with a Chinese Sitcom | SpringerLink
Skip to main content

Do Large Language Models Understand Conversational Implicature – A Case Study with a Chinese Sitcom

  • Conference paper
  • First Online:
Chinese Computational Linguistics (CCL 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14761))

Included in the following conference series:

Abstract

Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom My Own Swordsman. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation. Moreover, we find LLMs’ performance does not vary significantly by Gricean maxims, suggesting that LLMs do not seem to process implicatures derived from different maxims differently. Our data and code are available at https://github.com/sjtu-compling/llm-pragmatics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    A test to diagnose the conversational implicature by encoding semantically the negation of the target meaning. If the result seems consistent, then the target meaning is likely an implicature.

  2. 2.

    The distractors can be understood as “neutral” statements in the Natural Language Inference task [2].

  3. 3.

    The four OpenAI models are evaluated on November 15th, 2023.

  4. 4.

    https://huggingface.co.

  5. 5.

    https://huggingface.co/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16.

  6. 6.

    We also evaluate Baichuan2-13B-Chat and InternLM-Chat-20B (in half precision) with evaluation paradigm for close-source models. Their accuracy are separately 43% and 62%.

References

  1. Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  2. Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)

    Google Scholar 

  3. Brown, P., Levinson, S.C.: Politeness: Some Universals in Language Usage, vol. 4. Cambridge University Press (1987)

    Google Scholar 

  4. Chen, Y., Li, Z., Liang, J., Xiao, Y., Liu, B., Chen, Y.: Can pre-trained language models understand Chinese humor? In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. pp. 465–480 (2023)

    Google Scholar 

  5. Cui, L., Wu, Y., Liu, S., Zhang, Y., Zhou, M.: MuTual: a dataset for multi-turn dialogue reasoning. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 1406–1416. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.130, https://aclanthology.org/2020.acl-main.130

  6. Cui, Y., Yang, Z., Yao, X.: Efficient and effective text encoding for Chinese LLaMA and alpaca (2023)

    Google Scholar 

  7. Engelhardt, P.E., Bailey, K.G., Ferreira, F.: Do speakers and listeners observe the Gricean maxim of quantity? J. Mem. Lang. 54(4), 554–573 (2006)

    Article  Google Scholar 

  8. Floyd, S., Gibson, E., Fedorenko, E.: Pragmega (2023). https://osf.io/dpge6

  9. Grice, H.P.: Logic and conversation. In: Speech Acts, pp. 41–58. Brill (1975)

    Google Scholar 

  10. Grice, H.P.: Retrospective epilogue. Stud. Way Words 339, 386 (1989)

    Google Scholar 

  11. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. In: International Conference on Learning Representations (2020)

    Google Scholar 

  12. Hessel, J., et al.: Do androids laugh at electric sheep? Humor “understanding” benchmarks from the New Yorker caption contest. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 688–714. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.41, https://aclanthology.org/2023.acl-long.41

  13. Hirschberg, J.: A Theory of Scalar Implicature. University of Pennsylvania (1985). https://books.google.com/books?id=bvEYAQAAIAAJ

  14. Hu, J., Floyd, S., Jouravlev, O., Fedorenko, E., Gibson, E.: A fine-grained comparison of pragmatic language understanding in humans and language models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4194–4213. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.230, https://aclanthology.org/2023.acl-long.230

  15. Hu, J., Levy, R., Degen, J., Schuster, S.: Expectations over unspoken alternatives predict pragmatic inferences. arXiv preprint arXiv:2304.04758 (2023)

  16. Jentzsch, S., Kersting, K.: ChatGPT is fun, but it is not funny! humor is still challenging large language models. arXiv preprint arXiv:2306.04563 (2023)

  17. Kim, Z.M., Taylor, D.E., Kang, D.: Is the pope catholic? Applying chain-of-thought reasoning to understanding conversational implicatures. arXiv preprint arXiv:2305.13826 (2023)

  18. Kočiský, T., et al.: The NarrativeQA reading comprehension challenge. Trans. Assoc. Comput. Linguist. 6, 317–328 (2018). https://doi.org/10.1162/tacl_a_00023, https://aclanthology.org/Q18-1023

  19. Li, H., et al.: CMMLU: measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212 (2023)

  20. Li, H., Zhu, S.C., Zheng, Z.: DiPlomat: a dialogue dataset for situated pragmatic reasoning (2023)

    Google Scholar 

  21. Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners. arXiv preprint arXiv:2305.01020 (2023)

  22. Muennighoff, N., et al.: Crosslingual generalization through multitask finetuning. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15991–16111. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.891, https://aclanthology.org/2023.acl-long.891

  23. Neidlein, A., Wiesenbach, P., Markert, K.: An analysis of language models for metaphor recognition. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3722–3736 (2020)

    Google Scholar 

  24. Okanda, M., Asada, K., Moriguchi, Y., Itakura, S.: Understanding violations of Gricean maxims in preschoolers and adults. Front. Psychol. 6, 901 (2015)

    Article  Google Scholar 

  25. Pandia, L., Cong, Y., Ettinger, A.: Pragmatic competence of pre-trained language models through the lens of discourse connectives. arXiv preprint arXiv:2109.12951 (2021)

  26. Panzeri, F., Foppolo, F.: Children’s and adults’ sensitivity to Gricean maxims and to the maximize presupposition principle. Front. Psychol. 12, 624628 (2021)

    Article  Google Scholar 

  27. Patro, B.N., Lunayach, M., Srivastava, D., Singh, H., Namboodiri, V.P., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 576–585 (2021)

    Google Scholar 

  28. Qiu, Z., Duan, X., Cai, Z.: Does ChatGPT resemble humans in processing implicatures? In: Proceedings of the 4th Natural Logic Meets Machine Learning Workshop, pp. 25–34 (2023)

    Google Scholar 

  29. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  30. Reddy, S., Chen, D., Manning, C.D.: CoQA: a conversational question answering challenge. Trans. Assoc. Comput. Linguist. 7, 249–266 (2019)

    Article  Google Scholar 

  31. Rubio-Fernandez, P.: Overinformative speakers are cooperative: revisiting the Gricean maxim of quantity. Cogn. Sci. 43(11), e12797 (2019)

    Article  Google Scholar 

  32. Ruis, L., Khan, A., Biderman, S., Hooker, S., Rocktäschel, T., Grefenstette, E.: Large language models are not zero-shot communicators. arXiv preprint arXiv:2210.14986 (2022)

  33. Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social IQa: commonsense reasoning about social interactions. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463–4473. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1454, https://aclanthology.org/D19-1454

  34. Searle, J.R., Kiefer, F., Bierwisch, M. (eds.): Speech Act Theory and Pragmatics. Springer Netherlands, Dordrecht (1980). https://doi.org/10.1007/978-94-009-8964-1

    Book  Google Scholar 

  35. Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. In: Zong, C., Strube, M. (eds.) Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1577–1586. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-1152, https://aclanthology.org/P15-1152

  36. Sun, K., Yu, D., Chen, J., Yu, D., Choi, Y., Cardie, C.: DREAM: a challenge data set and models for dialogue-based reading comprehension. Trans. Assoc. Computat. Linguist. 7, 217–231 (2019). https://doi.org/10.1162/tacl_a_00264, https://aclanthology.org/Q19-1014

  37. Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  38. Wachowiak, L., Gromann, D.: Does GPT-3 grasp metaphors? Identifying metaphor mappings with generative language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1018–1032 (2023)

    Google Scholar 

  39. Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2596–2605 (2017)

    Google Scholar 

  40. Wilson, D., Sperber, D.: Relevance theory. The Handbook of Pragmatics, pp. 606–632 (2006)

    Google Scholar 

  41. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6

  42. Wu, J., Lin, H., Yang, L., Xu, B.: MUMOR: a multimodal dataset for humor detection in conversations. In: Wang, L., Feng, Y., Hong, Yu., He, R. (eds.) NLPCC 2021. LNCS (LNAI), vol. 13028, pp. 619–627. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88480-2_49

    Chapter  Google Scholar 

  43. Wu, Y., Wu, W., Xing, C., Zhou, M., Li, Z.: Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In: Barzilay, R., Kan, M.Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 496–505. Association for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/P17-1046, https://aclanthology.org/P17-1046

  44. Zhang, Z., Li, J., Zhu, P., Zhao, H., Liu, G.: Modeling multi-turn conversation with deep utterance aggregation. In: Bender, E.M., Derczynski, L., Isabelle, P. (eds.) Proceedings of the 27th International Conference on Computational Linguistics, pp. 3740–3752. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018). https://aclanthology.org/C18-1317

  45. Zheng, Z., Qiu, S., Fan, L., Zhu, Y., Zhu, S.C.: GRICE: a grammar-based dataset for recovering implicature and conversational reasoning. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2074–2085 (2021)

    Google Scholar 

Download references

Acknowledgements

We thank Xinjia Qi, Qiyu Sun and Yaqian Zhang for verifying the implicatures and improving the dataset. We thank all participants for their support in this study. We also thank the anonymous reviewers for their valuable comments. This project is funded by Shanghai Pujiang Program (22PJC063) awarded to Hai Hu.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai Hu .

Editor information

Editors and Affiliations

Ethics declarations

Limitations

Our dataset is sourced exclusively from the Chinese sitcom My Own Swordsman. Although we performed a rigorous proofreading on our data, selecting dialogues whose interpretation only depends on the local information and providing background knowledge when necessary, there may still be features specific to this sitcom such as the personality of the characters that play a role in determining the implicature, which may influence the generalizability of our conclusion.

Appendices

A Evaluation Paradigms and Differences

Models’ performance might vary from how their answers are estimated and collected. Next token prediction and free generation are two paradigms separately used to estimate the performance of open-source and close-source models in Experiment 1. Table 5 shows the comparison of the open-source models’ performance on the multiple-choice questions when their answers are estimated through the two paradigms. The result shows a decrease in accuracy in BLOOMZ (7.1B), CausalLM (13B), and OpenBuddy-Llama2 (13B) and a slight increase in Chinese-Alpaca-2 (13B) when the paradigm switches from next token prediction to free generation. This is aligned with the findings of~ [19]. Among the four models CausalLM (13B) has a dramatic decrease in its accuracy, from 78.5% to 31.5%, which corresponds to its bad performance in Experiment 2. We find that it fails to give a definite answer in its responses for over half of the questions, as shown in Fig. 5.

Table 5. Comparison between the accuracy of open-source models on multiple-choice questions when evaluated with next token prediction and free generation paradigms.
Fig. 5.
figure 5

Answer distribution of models when answers are collected with free generation paradigm.

B Hyperparameter Setting

The hyperparameters used for gathering responses from open-source models in Experiment 1 and Experiment 2 are shown in Tables 6 and 7.

Table 6. Parameter setting for open-source models in Experiment 1
Table 7. Parameter setting for open-source models in Experiment 2

C Average Answer Length for Different Question Types

We present the average answer length for the four maxims in Table 8.

Table 8. Average number of Chinese characters in each answer for all questions and for each type of questions.

D Prompt for Experiment 1

你现在是一个中文母语者。对于以下对话,请识别特定人物的话语中的的言外之意,在给出的四个选项中选择一个你认为的正确答案。[En: You are now a native Chinese speaker. For the following dialogue, please identify the implied meaning in the specific character’s speech, and choose one of the four given options that you think is the correct answer. ]

< Dialogue >

< Four interpretations as choices >

(Close-source models:) 请在‘Response:’后写出你选择的答案。 [En: Please write your answer after ‘Response:’]

(Open-source models:) 答案: [En: Answer:]

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yue, S., Song, S., Cheng, X., Hu, H. (2025). Do Large Language Models Understand Conversational Implicature – A Case Study with a Chinese Sitcom. In: Sun, M., et al. Chinese Computational Linguistics. CCL 2024. Lecture Notes in Computer Science(), vol 14761. Springer, Singapore. https://doi.org/10.1007/978-981-97-8367-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8367-0_24

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8366-3

  • Online ISBN: 978-981-97-8367-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics