Comprehensive Punctuation Restoration for English and Polish

Michał Pogoda; Tomasz Walkowiak

doi:10.18653/v1/2021.findings-emnlp.393

Comprehensive Punctuation Restoration for English and Polish

Abstract

Punctuation restoration is a fundamental requirement for the readability of text derived from Automatic Speech Recognition (ASR) systems. Most contemporary solutions are limited to predicting only a few of the most frequently occurring marks, such as periods, commas, and question marks - and only one per word. However, in written language, we deal with a much larger number of punctuation characters (such as parentheses, hyphens, etc.), and their combinations (like parenthesis followed by a dot). Such comprehensive punctuation cannot always be unambiguously reduced to a basic set of the most frequently occurring marks. In this work, we evaluate several methods in the comprehensive punctuation reconstruction task. We conduct experiments on parallel corpora of two different languages, English and Polish - languages with a relatively simple and complex morphology, respectively. We also investigate the influence of building a model on comprehensive punctuation on the quality of the basic punctuation restoration task

Anthology ID:: 2021.findings-emnlp.393
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2021
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: Findings
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4610–4619
Language:
URL:: https://aclanthology.org/2021.findings-emnlp.393
DOI:: 10.18653/v1/2021.findings-emnlp.393
Bibkey:
Cite (ACL):: Michał Pogoda and Tomasz Walkowiak. 2021. Comprehensive Punctuation Restoration for English and Polish. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4610–4619, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Comprehensive Punctuation Restoration for English and Polish (Pogoda & Walkowiak, Findings 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.findings-emnlp.393.pdf
Software:: 2021.findings-emnlp.393.Software.zip
Video:: https://aclanthology.org/2021.findings-emnlp.393.mp4

PDF Cite Search Software Video