Abstract
Arabic datasets that have two or more records for the same world entity (i.e. person, object, etc.) make institutions suffer from low quality and degraded performance due to duplication in their Arabic datasets without having any mechanism for detecting these duplicates. The operation that distinguishes records for the same real-world entity is called Entity Resolution (ER). It is considered as a tool for linking records across databases as well as for matching query records with existing databases in real-time. Indexing is a major step in the ER process that aims at reducing the search space. Several indexing techniques are available for use with the ER process in general for English Databases. However, such techniques are not validated if they work well with other languages, such as Arabic. The Dynamic Similarity Aware Inverted Index (DySimII) is one of the indexing techniques that are utilized with dynamic databases to match query records in real time and is demonstrated to work well with English language. In this paper, we propose a framework—Arabic Real Time Entity Resolution (ARTER)—that uses DySimII with Arabic databases to perform real time ER. We also examine using different string similarity functions required for comparing records in the matching process for the aim of evaluating which similarity function is more suitable for comparing Arabic strings. A real-world Arabic database is used to conduct our experimental evaluation where two stemmers and three similarity functions are used to see the effect on DySimII with Arabic dataset. The results represent that matching accuracy is improved using Asem stemmer when the number of corrupted attributes is increased, also testing the three similarity functions show that using winkler similarity function provides better matching accuracy while N-gram provides better results when used with Asem stemmer.
Similar content being viewed by others
References
Al_Molijy, A. A., Hmeidi, I., & Alsmadi, I. I. (2012). Indexing of Arabic documents automatically based on lexical analysis. International Journal on Natural Language Computing (IJNLC), 1(1), 1–8.
Alian, M., Al-Naymat, G., Ramadan, B. (2017). Using Transliteration with Entity Resolution for Arabic Datasets. In 14th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’2017). Hammamet.
Alian, M., Awajan, A., & Ramadan, B. (2019). Unsupervised learning blocking keys technique for indexing Arabic entity resolution. International Journal of Speech Technology, 22(3), 621–628.
Al-Jumaily, H., Martínez, P., Martínez-Fernández, J. L., & Der Goot, E. V. (2012). A real time Named Entity Recognition system for Arabic text mining. Language Resources and Evaluation., 46, 543–563.
Al-Lahham, Y., Matarneh, K., Hassan, M.(2018). Conditional Arabic Light Stemmer: CondLight. The International Arab Journal of Information Technology, Special Issue, 15, 3A.
Al-Shalabi, R., Obeidat, R. (2008). Improving KNN Arabic Text Classification with N-Grams Based Document Indexing. In 6th International Conference on Informatics and Systems (INFOS 2008). Cairo-Egypt.
Azmi, A. M., & Al-Thanyyan, S. (2012). A text summarizer for Arabic. Computer Speech and Language, 26(4), 260–273.
Bahassine, S., Kissi, M., Madani, A. (2014). New stemming for arabic text classification using feature selection and decision trees. In IEEE 5th International Conference on Arabic Language Processing (CITALA). Oujda, Morocco.
Bazzi, M. S. E., Zaki, T., Mammass, D. (2016). Ennaji, A. Stemming versus multi-words indexing for Arabic documents classification. In 11th International Conference on Intelligent Systems: Theories and Applications (SITA). (pp. 1–5). Mohammedia, Morocco.
Ben Guirat, S., Bounhas, I., Slimani, Y. (2016). A hybrid model for Arabic document indexing. In 17th IEEE/ACIS International Conference on Software Engineering, 17th IEEE/ACIS International Conference on Software Engineering, Artificial IntelligenceArtificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). (pp. 109–114). Shanghai.
Boudchiche, M., Mazroui, A., Ould Bebah, M. O. A., Lakhouaj, A., & Boudlal, A. (2017). AlKhalil Morpho Sys 2: a robust Arabic morpho-syntactic analyzer. Journal of King Saud University Computer and Information Sciences, 29, 141–146.
Boulaknadel, S., Daille, B., Driss, A. (2008). Multi-word term indexing for Arabic document retrieval. In 2008 IEEE Symposium on Computers and Communications Marrakech (pp. 869–873).
Bounhas, I., Ayed, R., Elayeb, B., & Saoud, B. N. B. (2015). A hybrid possibilistic approach for Arabic full morphological disambiguation. Data and Knowledge Engineering., 100, 240–254.
Buckwalter, T. (2002). Buckwalter arabic morphological analyzer, Version 1.0. Linguistic Data Consortium.
Chelli, A. (2016). ASem Light Stemmer. http://www.arabicstemmer.com/.
Chelli, A., Balla, A., Zerrouki, T. (2012). Advanced search in Quran: classification and proposition of all possible features. In The eighth international conference on Language Resources and Evaluation (LREC’2012) Workshop. (pp. 7–12). https://www.researchgate.net/publication/268523279_Proceedings_of_LREC%272012_Workshop. Accessed Dec 16, 2016.
Christen, P. (2012). Data matching: concepts and techniques for record link-age, entity resolution, and duplicate detection. Cham: Springer.
Christen, P., Gayler, R., Hawking, D. (2009). Similarity-aware indexing for real-time entity resolution. In ACM Conference on Information and Knowledge Management (CIKM). (pp. 1565–1568). Hong Kong.
Christen, P., Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Fabrice J. Guillet, Howard J. Hamilton, ed., Quality measures in data mining. Cham: Springer.
Darwish, K. (2002). Building a Shallow Arabic morphological analyzer in one day. In the ACL-02 Workshop on computational approaches to semitic languages.
Darwish, D., Oard, K. (2002). Term Selection for Searching Printed Arabic. In the 25th ACM SIGIR Conference, (pp. 261–268).
Diab, M., Hacioglu, K., and Jurafsky, D. (2004). Automatic tagging of Arabic text: from raw test to base phrase chunks. HLT-NAACL.
Elmagarmid, A., Ipeirotis, P., & Verykios, V. S. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Giridhar, N. S., Prema, K. V., & Reddy, S. N. V. (2011). A prospective study of stemming algorithms for web text mining. Ganpat University Journal of Engineering & Technology, 1(1), 28–34.
Hayder, K., Al Ameed, K., Al Ketbi, O.S., Al Kaabi, A.A., Al Shebli, K.S., Al Shamsi, F., Al Nuaimi, N.H., Al Muhairi, S.S.(2005). Arabic light stemmer: anew enhanced approach. In The Second International Conference on Innovations in Information Technology (IIT’05). (pp. 1–9).
Jivani, A. G. (2011). A Comparative Study Of Stemming Algorithms. International Journal of Computer Technology and Applications (IJCTA)., 2, 1930–1938.
Khoja, S., Garside, R. (1999). Stemming Arabic Text. Lancaster University. http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.
Larkey, L., Ballesteros, L., Connell, M.E. (2002). Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In SIGIR’02 (pp. 275–282). Tampere, Finland.
Mazari, A. C., Aliane,H., Alimazighi, Z. (2013). A conceptual indexing approach for Arabic texts. In 2013 ACS International Conference on Computer Systems and Applications (AICCSA) (pp. 1–1). Ifrane.
Mubarak, H. (2018). Build Fast and Accurate Lemmatization for Arabic. In the 11th International Conference on Language Resources and Evaluation (LREC). (pp. 1128–1132).
Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31–88.
Otair, M. A. (2013). Comparative analysis of Arabic stemming algorithms. International Journal of Managing Information Technology., 5(2), 1–12.
Ramadan, B. (2016). Indexing techniques for real-time entity resolution. PhD Thesis, Australian National University, Canberra.
Ramadan, B., Christen, P., Liang, H., & Gayler, R. W. (2015). Dynamic sorted neighborhood indexing for real-time entity resolution. Journal of Data and Information Quality., 6(4), 1–29.
Ramadan, B., Christen, P., Liang, H., Gayler, R., Hawking, D. (2013). Dynamic similarity-aware inverted indexing for real-time entity resolution. In International Workshop on Data Mining Applications in Industry and Government (DMApps’13). Gold Coast, Australia held at PAKDD’13.
Sophoclis, N. N., Abdeen, M., El-Horbaty, E. S. M., Yagoub, M. (2012). A novel approach for indexing Arabic documents through GPU computing. In 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). (pp. 1–4). Montreal.
Taghva, K., Elkoury, R., Coombs, J. (2005). Arabic Stemming without a root dictionary. In The International Conference on Information Technology: Coding and Computing (ITCC’05).
Tran, K.N., Vatsalan, D., Christen, P. (2013). GeCo—an online personal data Generator and Corruptor. In ACM Conference on Information and Knowledge Management (ICIKM’13). (pp. 2473–2475). San Francisco. http://dmm.anu.edu.au/geco.
Wang, Y., Qin, J., Wang, W. (2017). Efficient Approximate Entity Matching Using Jaro-Winkler Distance. In 18th International Conference on Web Information Systems Engineering (WISE).
Winkler, W. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In The Section on Survey Research Methods. (pp. 354–359). American Statistical Association.
Yancey, W.E. (2005). Evaluating string comparator performance for record linkage. Technical Report RR2005/05.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alian, M., Al-Naymat, G. & Ramadan, B. Arabic real time entity resolution using inverted indexing. Lang Resources & Evaluation 54, 921–941 (2020). https://doi.org/10.1007/s10579-020-09504-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-020-09504-6