Abstract
Email signature is considered imperative for effective business email communication. Despite the growth of social media, it is still a powerful tool that can be used as a business card in the online world which presents all business information including name, contact number and address to recipients. Signatures can vary a lot in their structure and content, so it is a great challenge to automatically extract them. In this paper we present a hybrid approach to automatic signature extraction. First step is to obtain the original most recently sent message from the entire email thread, cleaned from all disclaimers and superfluous lines, making the signature to be at the bottom of the email. Then we apply Support Vector Machine (SVM) Machine Learning (ML) technique to classify emails according to whether they contain a signature. To improve obtained results we apply a set of sophisticated Information Extraction (IE) rules. Finally, we extract signatures with a great success. We trained and tested our technique on a wide range of different data: Forge dataset, Enron with our own collection of emails and a large set of emails provided by our native English-speaking friends. We extracted signatures with precision 99.62% and recall 93.20%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Forge dataset. http://github.com/materials-data-facility/forge
Mailgun, open sourcing our email signature parsing library. http://www.mailgun.com/blog/open-sourcing-our-email-signature-parsing-library/
SVM, Scikit Learn Library. http://scikit-learn.org/stable/modules/svm.html
Talon, the Mailgun’s Python library. http://github.com/mailgun/talon
Text Minner, Email Signature Extractor. http://appsource.microsoft.com/en-us/product/office/wa104380692
Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval, vol. 463. ACM press New York (1999)
Carvalho, V.R., Cohen, W.W.: Learning to extract signature and reply lines from email. In: Proceedings of the Conference on Email and Anti-Spam, vol. 2004 (2004)
Chen, H., Hu, J., Sproat, R.W.: Integrating geometrical and linguistic analysis for email signature block parsing. ACM Trans. Inform. Syst. (TOIS) 17(4), 343–366 (1999)
Graovac, J.: A variant of n-gram based language-independent text categorization. Intell. Data Anal. 18(4), 677–695 (2014)
Graovac, J., Kovačević, J., Pavlović-Lažetić, G.: Hierarchical vs. flat n-gram-based text categorization: can we do better? Computer Science and Information Systems 14(1), 103–121 (2017)
Graovac, J., Mladenović, M., Tanasijević, I.: Ngramspd: Exploring optimal n-gram model for sentiment polarity detection in different languages. Intell. Data Anal. 23(2), 279–296 (2019)
Joachims, T.: Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers (2002)
Joachims, T.: A statistical learning model of text classification for svms. In: Learning to Classify Text Using Support Vector Machines, pp. 45–74. Springer (2002). https://doi.org/10.1007/978-1-4615-0907-3_4
Lang, K.: The 20 newsgroups data set, version 20news-18828 (1995)
Lawson, N., Eustice, K., Perkowitz, M., Yetisgen-Yildiz, M.: Annotating large email datasets for named entity recognition with mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp. 71–79 (2010)
Minkov, E., Wang, R.C., Cohen, W.: Extracting personal names from email: Applying named entity recognition to informal text. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp. 443–450 (2005)
Radicati, S.: Email market, 2021–2025. The Radicati Group Inc, Palo Alto, CA (2021)
Tanasijević, I.: Multimedial databases in managing the intagible cultural heritage. University of Belgrade (2021)
Tanasijević, I., Pavlović-Lažetić, G.: Herculb: content-based information extraction and retrieval for cultural heritage of the balkans. The electronic library (2020)
Tang, J., Li, H., Cao, Y., Tang, Z.: Email data cleaning. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 489–498 (2005)
Acknowledgements
The work presented has been supported by the Ministry of Science and Technological Development, Republic of Serbia, through Projects No. 174021 and No. III47003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Graovac, J., Tomašević, I., Pavlović-Lažetić, G. (2022). Meet Your Email Sender - Hybrid Approach to Email Signature Extraction. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_44
Download citation
DOI: https://doi.org/10.1007/978-3-031-21967-2_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)