ISCA Archive - Cross-dialect lexicon optimisation for an endangered language ASR system: the case of Irish
ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Cross-dialect lexicon optimisation for an endangered language ASR system: the case of Irish

Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide

Lexicon optimisation strategies, addressing the problem of dialect divergence, are tested in an ASR system for Irish. As in many endangered languages, Irish has no spoken standard, but rather, three very different dialects of Ulster (Ul), Connaught (Co) and Munster (Mu). Furthermore, the complex sound system and ancient, opaque writing system result in sound-to-grapheme mappings that differ considerably across dialects. A hybrid ASR system was trained on (predominantly) native speaker speech data, balanced across the dialects. Experiment 1 tested whether a Global lexicon, which captures dialect variant forms with relatively abstract representations, can perform as well as a Multi-dialect lexicon containing all dialect variants. Three dialect-specific lexicons were also included in the tests. The Global lexicon did yield the best performance and experiment 2 tested whether further reductions to its phoneset might further enhance its performance. These included (i) merging a Tense-Lax contrast among coronal sonorants, not common to all dialects, and (ii) merging the contrast of voiceless-voiced sonorants, as the voiceless member is relatively infrequent. Results showed but a slight enhancement and only for Mu dialect, which is the one most aligned to the phoneset reduction.