Abstract
The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule-based and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the final solution. The tool is also capable of producing grammar tags on words to aid in further language model creation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Filip, G., Krzysztof, J., Agnieszka, W., Mikołaj, W.: Text Normalization as a Special Case of Machine Translation. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, vol. 1 (2006)
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)
Dumke, R.R., Abran, A. (eds.): IWSM 2000. LNCS, vol. 2006. Springer, Heidelberg (2001)
Michalewicz, Z.: Genetic algorithms + Data Structures = Evolution Programs. Springer (1994)
Michalewicz, Z., Fogel, D.B.: How to Solve It: Modern Heuristics. Springer (1999)
Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna / The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa (2004)
Savary, A., Rabiega-Wiśniewska, J., Woliński, M.: Inflection of Polish Multi-Word Proper Names with Morfeusz and Multiflex. In: Marciniak, M., Mykowiecka, A. (eds.) Aspects of Natural Language Processing. LNCS, vol. 5070, pp. 111–141. Springer, Heidelberg (2009)
Bilmes, J.A., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of HLT/NACCL, pp. 4–6 (2003)
Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Computer, Speech and Language 393, 359–393 (1999)
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 3, 400–401 (1987)
Kneser, R., Ney, H.: Improved backing-off for n-gram language modeling. In: International Conference on Acoustics, Speech and Signal Processing, pp. 181–184 (1995)
Chung, G., Seneff, S., Wang, C.: Automatic Induction of Language Model Data for A Spoken Dialogue System. In: 6th SIGdial Workshop on Discourse and Dialogue Lisbon, Portugal, September 2-3 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brocki, Ł., Marasek, K., Koržinek, D. (2012). Multiple Model Text Normalization for the Polish Language. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2012. Lecture Notes in Computer Science(), vol 7661. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34624-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-34624-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34623-1
Online ISBN: 978-3-642-34624-8
eBook Packages: Computer ScienceComputer Science (R0)