Abstract
Automatic recognition of offline Arabic text still faces a big challenge due to the Arabic script nature. Recently, researcher’s attention has been increased and variant methods had been applied in this area. This paper presents a comparative study of four OCR (Optical Character Recognition) post-processing error correction techniques. We evaluate their impact using two recognition approaches: a lexicon driven approach with and without the presence of OOV (Out Of Vocabulary) words and a lexicon free-based approach. An AOCR (Arabic Optical Character Recognition) is developed for this purpose. This system is based on HMM (Hidden Markov Model) segmentation free approach. A sliding window is performed on the line image from right to left in order to extract the oriented gradient histogram (HOG) features. Experiments are carried out on KAFD database using different scenarios and revealed a significant improvement in OCR error correction rate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Fink, G.A., Zhuang, C., Zhu, L.: A post-processing approach for handwritten Chinese address recognition. J. Chin. Inf. Process. (2006)
Farooq, F., Jose, D., Govindaraju, V.: Phrase-based correction model for improving handwriting recognition accuracies. Pattern Recogn. 42(12), 3271–3277 (2009)
Perez-Cortes, J., Amengual, J., Arlandis, J., Llobet, R.: Stochastic error correcting parsing for OCR postprocessing. In: International Conference on Pattern Recognition (ICPR), vol. 4, pp. 405–408 (2000)
Llobet, R., Navarro-Cerdan, J.R., Perez-Cortes, J.-C., Arlandis, J.: OCR post-processing using weighted finite-state transducers. In: International Conference on Pattern Recognition (ICPR) (2010)
Mangu, L., Brill, E.: Automatic rule acquisition for spelling correction. In: International Conference on Machine Learning (ICML) (1997)
Hull, J.J.: Documents skew detection: survey and annotated bibliography. In: Document Analysis Systems II, pp. 40–64. World Scientific (1998)
Sauvola, J., PietikaKinen, M.: Adaptive document image binarization. Pattern Recogn. (PR) 33(2), 225–236 (2000)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (IJCV) 60(2), 91–110 (2004)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pp. 886–893 (2005)
HTK Speech Recognition Toolkit. http://htk.eng.cam.ac.uk/
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: Automatic Speech Recognition and Understanding. National Institute of Standards and Technology, Gaithersburg (1997)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Cybern. Control Theor. 10(8), 707–710 (1966)
Brants, T., Franz, A.: Web 1T 5-gram Version 1. Linguistic Data Consortium, Philadelphia (2006)
Wemhoener, D., Yalniz, I.Z., Manmatha, R.: Creating an improved version using noisy OCR from multiple editions. In: International Conference on Document analysis and Recognition (ICDAR) (2013)
Zeki Yalniz, I., Manmatha, R.: A fast alignment scheme for automatic OCR evaluation of books. In: International Conference on Document analysis and Recognition (ICDAR) (2011)
Brakensiek, A., Willett, D., Rigoll, G.: Unlimited vocabulary script recognition using character n-grams. In: Proceedings of the 22nd DAGM Symposium, pp. 436–443 (2000)
Luqman, H., Mahmoud, S.A., Awaida, S.: KAFD Arabic font database. Pattern Recogn. 47(6), 2231–2240 (2014)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7, 171–176 (1964)
Young, S.J., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D.,Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.2. 1). Cambridge University Engineering Department (2002)
Liu, L.-M., Babad, Y.M., Sun, W., Chan, K.-K.: Adaptive post processing of OCR text via knowledge acquisition. In: Proceedings of the 19th Annual Conference on Computer Science (1991)
Yalniz, I.Z., Manmatha, R.: A fast alignment scheme for automatic OCR evaluation of books. In: International Conference on Document analysis and Recognition (ICDAR) (2011)
Markov, A.A.: Essai d‟une Recherche Statistique Sur le Texte du Roman. “Eugène Oneguine”, Bulletin de l’Académie Impériale des Sciences de St.-Pétersbourg. VI série, 7(3), 153–162 (1913)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Jemni, S.K., Kesentini, Y., Kanoun, S. (2017). Benchmarking Post-processing Techniques for Offline Arabic Text Recognition System. In: Abraham, A., Haqiq, A., Alimi, A., Mezzour, G., Rokbani, N., Muda, A. (eds) Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). HIS 2016. Advances in Intelligent Systems and Computing, vol 552. Springer, Cham. https://doi.org/10.1007/978-3-319-52941-7_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-52941-7_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52940-0
Online ISBN: 978-3-319-52941-7
eBook Packages: EngineeringEngineering (R0)