Abstract
Keyphrases capture the main content of a free text document. The task of automatic keyphrase extraction (AKPE) plays a significant role in retrieving and summarizing valuable information from several documents with different domains. Various techniques have been proposed for this task. However, supervised AKPE requires large annotated data and depends on the tested domain. An alternative solution is to consider a new independent domain method that can be applied to several domains (such as medical, social). In this paper, we tackle keyphrase extraction from single documents with HAKE, a novel unsupervised method that takes full advantage of mining linguistic, statistical, structural, and semantic text features simultaneously to select the most relevant keyphrases in a text. HAKE achieves higher F-scores than the unsupervised state-of-the-art systems on standard datasets and is suitable for real-time processing of large amounts of Web and text data across different domains. With HAKE, we also explicitly increase coverage and diversity among the selected keyphrases by introducing a novel technique (based on a parse tree approach, part of speech tagging, and filtering) for candidate keyphrase identification and extraction. This technique allows us to generate a comprehensive and meaningful list of candidate keyphrases and reduce the candidate set’s size without increasing the computational complexity. HAKE’s effectiveness is compared to twelve state-of-the-art and recent unsupervised approaches, as well as to some other supervised approaches. Experimental analysis is conducted to validate the proposed method using five of the top available benchmark corpora from different domains and shows that HAKE significantly outperforms both the existing unsupervised and supervised methods. Our method does not require training on a particular set of documents, nor does it depend on external corpora, dictionaries, domain, or text size. Our experiments confirm that HAKE’s candidate selection model and its ranking model are effective.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Sarkar K. A hybrid approach to extract keyphrases from medical documents. Int J Comput Appl. 2013;63(18):14–19.
Gutwin C, Paynter G, Witten I, Nevill-Manning C, Frank E. Improving browsing in digital libraries with keyphrase indexes. Decis Support Syst. 1999;27(1–2):81–104.
Jones S, Staveley MS. PHRASIER: a system for interactive document retrieval using keyphrases. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. 1999. pp. 160–167. ACM.
D’Avanzo E, Magnini B. A keyphrase-based approach to summarization: the LAKE system at DUC-2005. In: Proceedings of DUC. 2005.
Zha H. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. 2002. pp. 113–120. ACM.
Zhang Y, Zincir-Heywood N, Milios E. World Wide Web site summarization. Web Intelligence and Agent Systems: An International Journal. 2004;2(1), 39–53.
Hammouda KM, Matute DN, Kamel MS. COREPHRASE: keyphrase extraction for document clustering. In: International workshop on machine learning and data mining in pattern recognition. 2005. pp. 265–274.
Han J, Kim T, Choi J. Web document clustering by using automatic keyphrase extraction. In: 2007 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology - workshops. 2007. pp. 56–59. IEEE.
Hulth A, Megyesi BB. A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2006. pp. 537–544.
Berend G. Opinion expression mining by exploiting keyphrase extraction. In: proceedings of the 5th international joint conference on natural language processing. Asian Federation of Natural Language Processing. 2011.
Dashtipour K, Gogate M, Cambria E, Hussain A. A novel context-aware multimodal framework for Persian sentiment analysis. Neurocomputing. 2021.
Chen M, Sun JT, Zeng HJ, Lam KY. A practical system of keyphrase extraction for web pages. In: Proceedings of the 14th ACM international conference on information and knowledge management; 2005. pp. 277–278. ACM.
Turney PD. Coherent keyphrase extraction via web mining. CORR ArXiv Preprint Cs/0308033. 2003.
Ferrara F, Pudota N, Tasso C. A keyphrase-based paper recommender system. In: Italian research conference on digital libraries; 2011, pp. 14–25.
Do N, Ho L. Domain-specific keyphrase extraction and near-duplicate article detection based on ontology. In: International conference on computing & communication technologies, research, innovation, and vision for the future (RIVF). 2015; pp. 123–126. IEEE.
El Idrissi O, Frikh B, Ouhbi B. HCHIRSIMEX: an extended method for domain ontology learning based on conditional mutual information. In: Third IEEE international colloquium in information science and technology (CIST); 2014. pp. 91–95.
Fortuna B, Grobelnik M, Mladeni’c D. Semi-automatic data-driven ontology construction system. In: Proceedings of the 9th international multi-conference information society; 2006, pp. 223–226.
Frikh B, Djaanfar AS, Ouhbi B. A new methodology for domain ontology construction from the Web. Int J Artif Intell Tools. 2011;20(06):1157–70.
Merrouni ZA, Frikh B, Ouhbi B. Automatic keyphrase extraction: a survey and trends. Journal of Intelligent Information Systems. 2019. pp. 1–34. Springer.
You W, Fontaine D, Barth’es JP. An automatic keyphrase extraction system for scientific documents. Knowl Inf Syst. 2013;34(3), 691–724.
Kim SN, Medelyan O, Kan MY, Baldwin T. SEMEVAL-2010 Task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics. 2010. pp. 21–26.
Liu Z, Huang W, Zheng Y, Sun M. Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics. 2010. pp. 366–376.
Boudin F. Reducing over-generation errors for automatic keyphrase extraction using integer linear programming. In: ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction; 2015.
Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG. Domain-specific keyphrase extraction. In: Proceedings of the sixteenth international joint conference on artificial intelligence, IJCAI ‘99. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 1999 pp. 668–673. http://dl.acm.org/citation.cfm?id=646307.687591.
Turney PD. Learning algorithms for keyphrase extraction. Inf Retrieval. 2000;2(4):303–36.
Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG. KEA: Practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on digital libraries. 1999. pp. 254–255. ACM.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 2013. (pp. 3111–3119)
Huang C, Tian Y, Zhou Z, Ling CX, Huang T. Keyphrase extraction using semantic networks structure analysis. In: Sixth international conference on data mining (ICDM’06). 2006; pp. 275–284. IEEE.
Liu F, Pennell D, Liu F, Liu Y. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics. Association for Computational Linguistics. 2009. pp. 620–628.
Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A. YAKE! Keyword extraction from single documents using multiple local features. Inf Sci. 2020;509:257–89.
Haddoud M, Abdeddaïm S. Accurate keyphrase extraction by discriminating overlapping phrases. J Inf Sci. 2014; 40(4), 488–500.
Hulth A. Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing. Association for Computational Linguistics. 2003. pp. 216–223
Wan X, Xiao J. Single document keyphrase extraction using neighborhood knowledge. In: AAAI. 2008. vol. 8, pp. 855–860.
Barker K, Cornacchia N. Using noun phrase heads to extract document keyphrases. In: conference of the canadian society for computational studies of intelligence. 2000; pp. 40–52. Springer.
Nguyen TD, Kan MY. Keyphrase extraction in scientific publications. In: International conference on asian digital libraries. 2007. pp. 317–326. Springer.
Grineva M, Grinev M, Lizorkin D. Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World Wide Web, 2009. pp. 661–670.
El-Beltagy SR, Rafea A. KP-MINER: a keyphrase extraction system for English and Arabic documents. Inf Syst. 2009;34(1):132–44.
Newman D, Koilada N, Lau JH, Baldwin T. Bayesian text segmentation for index term identification and keyphrase extraction. Proceedings of COLING. 2012;2012:2077–92.
Medelyan O, Witten IH. Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries. 2006. pp. 296–297. ACM.
Mihalcea R, Tarau P. TEXTRANK: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004.
Mahata D, Kuriakose J, Shah R, Zimmermann R. Key2vec: automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2008. Volume 2 (Short Papers) (pp. 634–639)
Medelyan O, Frank E, Witten IH. Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing. 2009. p. 1318–1327.
You W, Fontaine D, Barthes JP. Automatic keyphrase extraction with a refined candidate set. In: Proceedings of the 2009 IEE/WIC/ACM International joint conference on web intelligence and intelligent agent technology. IEEE Computer Society. 2009. volume 01, pp. 576–579.
Liu Z, Li P, Zheng Y, Sun M. Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing. Association for Computational Linguistics. 2009. volume 1, pp. 257–266.
Rose S, Engel D, Cramer N, Cowley W. Automatic keyword extraction from individual documents. Text Mining: Applications and Theory. 2010;1:1–20.
Gollapalli SD, Caragea C. Extracting keyphrases from research papers using citation networks. In: AAAI. 2014. pp. 1629–1635.
Yang S, Lu W, Yang D, Li X, Wu C, Wei B. KEYPHRASEDS: automatic generation of survey by exploiting keyphrase information. Neurocomputing. 2017;224:58–70.
Xie F, Wu X, Zhu X. Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowl-Based Syst. 2017;115:27–39.
Rafiei-Asl J, Nickabadi A. Tsake: a topical and structural automatic keyphrase extractor. Appl Soft Comput. 2017;58:620–30.
Danesh S, Sumner T, Martin JH. SGRANK: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In: Proceedings of the fourth joint conference on lexical and computational semantics; 2015. pp. 117–126.
Rabby G, Azad S, Mahmud M, Zamli KZ, Rahman MM. A flexible keyphrase extraction technique for academic literature. Procedia Computer Science. 2018;135:553–63.
Matsuo Y, Ishizuka M. Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tools. 2004;13(01):157–69.
Li Y, Luo C, Chung SM. Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng. 2008;20(5):641–52.
Wang J, Peng H. Keyphrases extraction from web document by the least squares support vector machine. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05). 2005 pp. 293–296. IEEE.
Kumar N, Srinathan K. Automatic keyphrase extraction from scientific documents using n-gram filtration technique. In: Proceedings of the eighth ACM symposium on document engineering. 2008. pp. 199–208. ACM.
Berend G, Farkas R. SZTERGAK: feature engineering for keyphrase extraction. In: proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics. 2010. pp. 186–189.
Adar E, Datta S. Building a scientific concept hierarchy database (schbase). In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015. vol. 1, pp. 606–615.
Florescu C, Caragea C. A new scheme for scoring phrases in unsupervised keyphrase extraction. In: Proceedings of the 39th European Conference on Information Retrieval (ECIR’17), Aberdeen, Scotland. April 9–13. 2017. pp. 477–483.
Tomokiyo T, Hurst M. A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 workshop on multiword expressions: analysis, acquisition and treatment. Association for Computational Linguistic. 2003. volume 18, pp. 33–40.
Rabby G, Azad S, Mahmud M, Zamli KZ, Rahman MM. TeKET: a tree-based unsupervised keyphrase extraction technique. Cogn Comput. 2020. 1–23.
Bougouin A, Boudin F, Daille B. Topicrank: graph-based topic ranking for keyphrase extraction. In: Proc IJCNLP; 2013. p. 543–551.
Sterckx L, Demeester T, Deleu J, Develder C. Topical word importance for fast keyphrase extraction. In Proceedings of the 24th International Conference on World Wide Web; 2015. (pp. 121–122).
Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3, 993–1022.
Florescu C, Caragea C. Positionrank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proc. ACL; 2017. p. 1105–1115.
Boudin F. Unsupervised keyphrase extraction with multipartite graphs. In: Proc NAACL: Human language technologies; 2018. p. 667–672.
Baroni M, Dinu G, Kruszewski G. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1). 2014 p. 238–247.
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014. (pp. 1532–1543).
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017;5:135–46.
Papagiannopoulou E, Tsoumakas G. Local word vectors guiding keyphrase extraction. Inf Process Manage. 2018;54(6):888–902.
Bennani-Smires K, Musat C, Hossmann A, et al. Simple unsupervised keyphrase extraction using sentence embeddings. In: Proceedings of the 22nd Conference on Computational Natural Language Learning. 2018. p. 221–229.
Sun Y, Qiu H, Zheng Y, Wang Z, Zhang C. SIFRank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access. 2020;8:10896–906.
Cohen JD. Highlights: Language-and domain-independent automatic indexing terms for abstracting. J Am Soc Inf Sci. 1995;46(3):162–74.
Nguyen TD, Luong MT. WINGNUS: keyphrase extraction utilizing document logical structure. In: Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics. 2010. pp. 166–169.
Ong TH, Chen H. Updateable pat-tree approach to Chinese keyphrase extraction using mutual information: A linguistic foundation for knowledge management. 1999.
Ramos J, et al. Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning. Piscataway, NJ. 2003. vol. 242, pp. 133–142.
Barzilay R, Elhadad M. Using lexical chains for text summarization. Advances in automatic text summarization pp. 1999; 111–121.
Krapivin M, Autayeu A, Marchese M, Blanzieri E, Segata N. Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing. In: International Conference on Asian Digital Libraries. 2010. pp. 102–111.
Krapivin M, Marchese M, Yadrantsau A, Liang Y. Unsupervised key-phrases extraction from scientific papers using domain and linguistic knowledge. In: 2008 Third International Conference on Digital Information Management. 2008. pp. 105–112. IEEE.
Le TTN, Le Nguyen M, Shimazu A. Unsupervised keyphrase extraction: Introducing new kinds of words to keyphrases. In: Australasian Joint Conference on Artificial Intelligence. 2016. pp. 665–671. Springer.
Salton G, Singhal A, Mitra M, Buckley C. Automatic text structuring and summarization. Inf Process Manage. 1997;33(2):193–207.
Lopez P, Romary L. HUMB: automatic key term extraction from scientific articles in GROBID. In: Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics. 2010. pp. 248–251.
Chua S, Kulathuramaiyer N. Semantic feature selection using wordnet. In: IEEE/WIC/ACM International Conference on Web Intelligence (WI’04); 2004. pp. 166–172.
Dagan I, Marcus S, Markovitch S. Contextual word similarity and estimation from sparse data. In: Proceedings of the 31st annual meeting on Association for Computational Linguistics. Association for Computational Linguistics; 1993. pp. 164–171.
Kelleher D, Luz S. Automatic hypertext keyphrase detection In: IJCAI. 2005;5:1608–9.
Li CH, Park SC. Combination of modified bpnn algorithms and an efficient feature selection method for text categorization. Inf Process Manage. 2009;45(3):329–40.
Song W, Liang JZ, He XL, Chen P. Taking advantage of improved resource allocating network and latent semantic feature selection approach for automated text categorization. Appl Soft Comput. 2014;21:210–20.
Frantzi KT, Ananiadou S, Tsujii J. The C-VALUE/NC-VALUE method of automatic recognition for multi-word terms. In: International conference on theory and practice of digital libraries. 1998, pp. 585–604.
Jiang X, Hu Y, Li H. A ranking approach to keyphrase extraction. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ‘09, pp. 756–757. ACM, New York, NY, USA. 2009. https://doi.org/10.1145/1571941.1572113. http://doi.acm.org/10.1145/1571941.1572113.
Zhang K, Xu H, Tang J, Li J. Keyword extraction using Support Vector Machine. In: international conference on web-age information management. 2006. pp. 85–96. Springer.
Yih WT, Goodman J, Carvalho VR. Finding advertising keywords on web pages. In: Proceedings of the 15th international conference on World Wide Web, WWW ‘06, pp. 213–222. ACM, New York, NY, USA. 2006. https://doi.org/10.1145/1135777.1135813. http://doi.acm.org/10.1145/1135777.1135813.
Sarkar K, Nasipuri M, Ghose S. A new approach to keyphrase extraction using neural networks. 2010. CoRR abs/1004.3274. http://arxiv.org/abs/1004.3274
De Marneffe MC, MacCartney B, Manning CD, et al. Generating typed dependency parses from phrase structure parses. In: Lrec; 2006. vol. 6. pp. 449–454.
Toutanova K, Klein D, Manning CD, Singer Y. Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology. Association for computational Linguistics. 2003. volume 1, pp. 173–180.
Sotoca JM, Pla F. Supervised feature selection by clustering using conditional mutual information-based distances. Pattern Recogn. 2010;43(6):2068–81.
Ding Z, Zhang Q, Huang X. Keyphrase extraction from online news using binary integer programming. In: Proceedings of 5th International Joint Conference on Natural Language Processing. 2011; pp. 165–173.
University of Waikato NZ. Datasets of automatic keyphrase extraction. https://github.com/LIAAD/KeywordExtractorDatasets#theses. 2019.
Medelyan O, Witten IH, Milne D. Topic indexing with Wikipedia. In Proceedings of the AAAI WikiAI workshop. 2008, July (Vol. 1, pp. 19–24).
Krapivin M, Autaeu A, Marchese M. Large dataset for keyphrases extraction, University of Trento. Tech Report # DISI-09–055. 2009.
Chen W, Chan HP, Li P, Bing L, King I. An integrated approach for keyphrase generation via exploring the power of retrieval and extraction. In: NAACL-HLT (1). 2019.
Kim SN, Medelyan O, Kan MY, Baldwin T. Automatic keyphrase extraction from scientific articles. Lang Resour Eval. 2013;47(3):723–42.
Boudin F. PKE: an open-source python-based keyphrase extraction toolkit. In: Proc COLING; 2016. p. 69–73.
Jones KS. A statistical interpretation of term specificity and its application in retrieval. J Document. 1972;28(1):11–21.
Zesch T, Gurevych I. Approximate matching for evaluating keyphrase extraction. In: Proceedings of the international conference ranLP. 2009. pp. 484–489.
Porter MF. An algorithm for suffix stripping. Program 1980;14(3), 130–137.
Pal T, Banka H, Mitra P. Das B. Linguistic knowledge based supervised key-phrase extraction. In: Proceedings of national conference on future trends in information & communication technology & applications, Bhubaneswar. India. 2011.
Kim SN, Baldwin T, Kan MY. Evaluating n-gram based evaluation metrics for automatic keyphrase extraction. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics. 2010. pp. 572–580.
Kim SN, Kan MY. Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications. Association for Computational Linguistics. 2009. pp. 9–16.
Pianta E, Tonelli S. Kx: A flexible system for keyphrase extraction. In: Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics. 2010. pp. 170–173.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Merrouni, Z.A., Frikh, B. & Ouhbi, B. HAKE: an Unsupervised Approach to Automatic Keyphrase Extraction for Multiple Domains. Cogn Comput 14, 852–874 (2022). https://doi.org/10.1007/s12559-021-09979-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-021-09979-7