Do Large Language Models Understand Conversational Implicature – A Case Study with a Chinese Sitcom

Yue, Shisen; Song, Siyuan; Cheng, Xinyuan; Hu, Hai

doi:10.1007/978-981-97-8367-0_24

Shisen Yue ORCID: orcid.org/0009-0007-5448-4118¹⁵,
Siyuan Song¹⁵,
Xinyuan Cheng¹⁵ &
…
Hai Hu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14761))

Included in the following conference series:

China National Conference on Chinese Computational Linguistics

189 Accesses

Abstract

Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom My Own Swordsman. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation. Moreover, we find LLMs’ performance does not vary significantly by Gricean maxims, suggesting that LLMs do not seem to process implicatures derived from different maxims differently. Our data and code are available at https://github.com/sjtu-compling/llm-pragmatics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Manner implicatures in large language models

Article Open access 24 November 2024

Commonsense Reasoning and Explainable Artificial Intelligence Using Large Language Models

Conversational question answering: a survey

Article Open access 06 September 2022

Notes

1.
A test to diagnose the conversational implicature by encoding semantically the negation of the target meaning. If the result seems consistent, then the target meaning is likely an implicature.
2.
The distractors can be understood as “neutral” statements in the Natural Language Inference task [2].
3.
The four OpenAI models are evaluated on November 15th, 2023.
4.
https://huggingface.co.
5.
https://huggingface.co/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16.
6.
We also evaluate Baichuan2-13B-Chat and InternLM-Chat-20B (in half precision) with evaluation paradigm for close-source models. Their accuracy are separately 43% and 62%.

References

Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)
Google Scholar
Brown, P., Levinson, S.C.: Politeness: Some Universals in Language Usage, vol. 4. Cambridge University Press (1987)
Google Scholar
Chen, Y., Li, Z., Liang, J., Xiao, Y., Liu, B., Chen, Y.: Can pre-trained language models understand Chinese humor? In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. pp. 465–480 (2023)
Google Scholar
Cui, L., Wu, Y., Liu, S., Zhang, Y., Zhou, M.: MuTual: a dataset for multi-turn dialogue reasoning. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 1406–1416. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.130, https://aclanthology.org/2020.acl-main.130
Cui, Y., Yang, Z., Yao, X.: Efficient and effective text encoding for Chinese LLaMA and alpaca (2023)
Google Scholar
Engelhardt, P.E., Bailey, K.G., Ferreira, F.: Do speakers and listeners observe the Gricean maxim of quantity? J. Mem. Lang. 54(4), 554–573 (2006)
Article Google Scholar
Floyd, S., Gibson, E., Fedorenko, E.: Pragmega (2023). https://osf.io/dpge6
Grice, H.P.: Logic and conversation. In: Speech Acts, pp. 41–58. Brill (1975)
Google Scholar
Grice, H.P.: Retrospective epilogue. Stud. Way Words 339, 386 (1989)
Google Scholar
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. In: International Conference on Learning Representations (2020)
Google Scholar
Hessel, J., et al.: Do androids laugh at electric sheep? Humor “understanding” benchmarks from the New Yorker caption contest. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 688–714. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.41, https://aclanthology.org/2023.acl-long.41
Hirschberg, J.: A Theory of Scalar Implicature. University of Pennsylvania (1985). https://books.google.com/books?id=bvEYAQAAIAAJ
Hu, J., Floyd, S., Jouravlev, O., Fedorenko, E., Gibson, E.: A fine-grained comparison of pragmatic language understanding in humans and language models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4194–4213. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.230, https://aclanthology.org/2023.acl-long.230
Hu, J., Levy, R., Degen, J., Schuster, S.: Expectations over unspoken alternatives predict pragmatic inferences. arXiv preprint arXiv:2304.04758 (2023)
Jentzsch, S., Kersting, K.: ChatGPT is fun, but it is not funny! humor is still challenging large language models. arXiv preprint arXiv:2306.04563 (2023)
Kim, Z.M., Taylor, D.E., Kang, D.: Is the pope catholic? Applying chain-of-thought reasoning to understanding conversational implicatures. arXiv preprint arXiv:2305.13826 (2023)
Kočiský, T., et al.: The NarrativeQA reading comprehension challenge. Trans. Assoc. Comput. Linguist. 6, 317–328 (2018). https://doi.org/10.1162/tacl_a_00023, https://aclanthology.org/Q18-1023
Li, H., et al.: CMMLU: measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212 (2023)
Li, H., Zhu, S.C., Zheng, Z.: DiPlomat: a dialogue dataset for situated pragmatic reasoning (2023)
Google Scholar
Lipkin, B., Wong, L., Grand, G., Tenenbaum, J.B.: Evaluating statistical language models as pragmatic reasoners. arXiv preprint arXiv:2305.01020 (2023)
Muennighoff, N., et al.: Crosslingual generalization through multitask finetuning. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15991–16111. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.891, https://aclanthology.org/2023.acl-long.891
Neidlein, A., Wiesenbach, P., Markert, K.: An analysis of language models for metaphor recognition. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3722–3736 (2020)
Google Scholar
Okanda, M., Asada, K., Moriguchi, Y., Itakura, S.: Understanding violations of Gricean maxims in preschoolers and adults. Front. Psychol. 6, 901 (2015)
Article Google Scholar
Pandia, L., Cong, Y., Ettinger, A.: Pragmatic competence of pre-trained language models through the lens of discourse connectives. arXiv preprint arXiv:2109.12951 (2021)
Panzeri, F., Foppolo, F.: Children’s and adults’ sensitivity to Gricean maxims and to the maximize presupposition principle. Front. Psychol. 12, 624628 (2021)
Article Google Scholar
Patro, B.N., Lunayach, M., Srivastava, D., Singh, H., Namboodiri, V.P., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 576–585 (2021)
Google Scholar
Qiu, Z., Duan, X., Cai, Z.: Does ChatGPT resemble humans in processing implicatures? In: Proceedings of the 4th Natural Logic Meets Machine Learning Workshop, pp. 25–34 (2023)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Reddy, S., Chen, D., Manning, C.D.: CoQA: a conversational question answering challenge. Trans. Assoc. Comput. Linguist. 7, 249–266 (2019)
Article Google Scholar
Rubio-Fernandez, P.: Overinformative speakers are cooperative: revisiting the Gricean maxim of quantity. Cogn. Sci. 43(11), e12797 (2019)
Article Google Scholar
Ruis, L., Khan, A., Biderman, S., Hooker, S., Rocktäschel, T., Grefenstette, E.: Large language models are not zero-shot communicators. arXiv preprint arXiv:2210.14986 (2022)
Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social IQa: commonsense reasoning about social interactions. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463–4473. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1454, https://aclanthology.org/D19-1454
Searle, J.R., Kiefer, F., Bierwisch, M. (eds.): Speech Act Theory and Pragmatics. Springer Netherlands, Dordrecht (1980). https://doi.org/10.1007/978-94-009-8964-1
Book Google Scholar
Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. In: Zong, C., Strube, M. (eds.) Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1577–1586. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-1152, https://aclanthology.org/P15-1152
Sun, K., Yu, D., Chen, J., Yu, D., Choi, Y., Cardie, C.: DREAM: a challenge data set and models for dialogue-based reading comprehension. Trans. Assoc. Computat. Linguist. 7, 217–231 (2019). https://doi.org/10.1162/tacl_a_00264, https://aclanthology.org/Q19-1014
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wachowiak, L., Gromann, D.: Does GPT-3 grasp metaphors? Identifying metaphor mappings with generative language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1018–1032 (2023)
Google Scholar
Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2596–2605 (2017)
Google Scholar
Wilson, D., Sperber, D.: Relevance theory. The Handbook of Pragmatics, pp. 606–632 (2006)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6
Wu, J., Lin, H., Yang, L., Xu, B.: MUMOR: a multimodal dataset for humor detection in conversations. In: Wang, L., Feng, Y., Hong, Yu., He, R. (eds.) NLPCC 2021. LNCS (LNAI), vol. 13028, pp. 619–627. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88480-2_49
Chapter Google Scholar
Wu, Y., Wu, W., Xing, C., Zhou, M., Li, Z.: Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In: Barzilay, R., Kan, M.Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 496–505. Association for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/P17-1046, https://aclanthology.org/P17-1046
Zhang, Z., Li, J., Zhu, P., Zhao, H., Liu, G.: Modeling multi-turn conversation with deep utterance aggregation. In: Bender, E.M., Derczynski, L., Isabelle, P. (eds.) Proceedings of the 27th International Conference on Computational Linguistics, pp. 3740–3752. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018). https://aclanthology.org/C18-1317
Zheng, Z., Qiu, S., Fan, L., Zhu, Y., Zhu, S.C.: GRICE: a grammar-based dataset for recovering implicature and conversational reasoning. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2074–2085 (2021)
Google Scholar

Download references

Acknowledgements

We thank Xinjia Qi, Qiyu Sun and Yaqian Zhang for verifying the implicatures and improving the dataset. We thank all participants for their support in this study. We also thank the anonymous reviewers for their valuable comments. This project is funded by Shanghai Pujiang Program (22PJC063) awarded to Hai Hu.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, 200240, China
Shisen Yue, Siyuan Song, Xinyuan Cheng & Hai Hu

Authors

Shisen Yue
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Song
View author publications
You can also search for this author in PubMed Google Scholar
Xinyuan Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Hai Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Hu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Shanxi University, Taiyuan, China
Jiye Liang
Chinese Academy of Sciences, Beijing, China
Xianpei Han
Tsinghua University, Beijing, China
Zhiyuan Liu
King's College London, London, UK
Yulan He
Beijing Language and Culture University, Beijing, China
Gaoqi Rao
Chinese Academy of Sciences, Beijing, China
Yubo Chen
National University of Defense Technology, Changsha, China
Zhiliang Tian

Ethics declarations

Limitations

Our dataset is sourced exclusively from the Chinese sitcom My Own Swordsman. Although we performed a rigorous proofreading on our data, selecting dialogues whose interpretation only depends on the local information and providing background knowledge when necessary, there may still be features specific to this sitcom such as the personality of the characters that play a role in determining the implicature, which may influence the generalizability of our conclusion.

Appendices

A Evaluation Paradigms and Differences

Models’ performance might vary from how their answers are estimated and collected. Next token prediction and free generation are two paradigms separately used to estimate the performance of open-source and close-source models in Experiment 1. Table 5 shows the comparison of the open-source models’ performance on the multiple-choice questions when their answers are estimated through the two paradigms. The result shows a decrease in accuracy in BLOOMZ (7.1B), CausalLM (13B), and OpenBuddy-Llama2 (13B) and a slight increase in Chinese-Alpaca-2 (13B) when the paradigm switches from next token prediction to free generation. This is aligned with the findings of～ [19]. Among the four models CausalLM (13B) has a dramatic decrease in its accuracy, from 78.5% to 31.5%, which corresponds to its bad performance in Experiment 2. We find that it fails to give a definite answer in its responses for over half of the questions, as shown in Fig. 5.

Table 5. Comparison between the accuracy of open-source models on multiple-choice questions when evaluated with next token prediction and free generation paradigms.

Full size table

B Hyperparameter Setting

The hyperparameters used for gathering responses from open-source models in Experiment 1 and Experiment 2 are shown in Tables 6 and 7.

Table 6. Parameter setting for open-source models in Experiment 1

Full size table

Table 7. Parameter setting for open-source models in Experiment 2

Full size table

C Average Answer Length for Different Question Types

We present the average answer length for the four maxims in Table 8.

Table 8. Average number of Chinese characters in each answer for all questions and for each type of questions.

Full size table

D Prompt for Experiment 1

你现在是一个中文母语者。对于以下对话，请识别特定人物的话语中的的言外之意，在给出的四个选项中选择一个你认为的正确答案。[En: You are now a native Chinese speaker. For the following dialogue, please identify the implied meaning in the specific character’s speech, and choose one of the four given options that you think is the correct answer. ]

< Dialogue >

< Four interpretations as choices >

(Close-source models:) 请在‘Response:’后写出你选择的答案。 [En: Please write your answer after ‘Response:’]

(Open-source models:) 答案： [En: Answer:]

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yue, S., Song, S., Cheng, X., Hu, H. (2025). Do Large Language Models Understand Conversational Implicature – A Case Study with a Chinese Sitcom. In: Sun, M., et al. Chinese Computational Linguistics. CCL 2024. Lecture Notes in Computer Science(), vol 14761. Springer, Singapore. https://doi.org/10.1007/978-981-97-8367-0_24

Download citation

DOI: https://doi.org/10.1007/978-981-97-8367-0_24
Published: 29 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8366-3
Online ISBN: 978-981-97-8367-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Do Large Language Models Understand Conversational Implicature – A Case Study with a Chinese Sitcom

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Manner implicatures in large language models

Commonsense Reasoning and Explainable Artificial Intelligence Using Large Language Models

Conversational question answering: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Limitations

Appendices

A Evaluation Paradigms and Differences

B Hyperparameter Setting

C Average Answer Length for Different Question Types

D Prompt for Experiment 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us