Abstract
We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection. It starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state-of-the-art and recent unsupervised keyphrase extraction methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Boudin, F.: PKE: an open source python-based keyphrase extraction toolkit. In: Proceedings of the 26th International Conference on Computational Linguistics, COLING 2016, Proceedings of the Conference System Demonstrations, Osaka, Japan, pp. 69–73 (2016). https://aclweb.org/anthology/C/C16/C16-2015.pdf
Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics Proceedings of NAACL, NAACL 2018, New Orleans (2018)
Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the 6th International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, pp. 543–551 (2013). https://aclweb.org/anthology/I/I13/I13-1062.pdf
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Das, S.: Elements of artificial neural networks [book reviews]. IEEE Trans. Neural Netw. 9(1), 234–235 (1998)
Dreiseitl, S., Osl, M., Scheibböck, C., Binder, M.: Outlier detection with one-class SVMs: an application to melanoma prognosis. In: AMIA Annual Symposium Proceedings. AMIA Symposium 2010, pp. 172–176 (2010). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041295/
Florescu, C., Caragea, C.: A position-biased pagerank algorithm for keyphrase extraction. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, California, USA, pp. 4923–4924 (2017). https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14377
Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, pp. 1105–1115 (2017). https://doi.org/10.18653/v1/P17-1102
Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), e0152173 (2016)
Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec, Canada, pp. 1629–1635 (2014). https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8662
Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, (Volume 1: Long Papers), Baltimore, MD, USA, pp. 1262–1273 (2014). https://aclweb.org/anthology/P/P14/P14-1119.pdf
Hawkins, S., He, H., Williams, G.J., Baxter, R.A.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46145-0_17
Hubert, M., Debruyne, M.: Minimum covariance determinant. Wiley Interdisc. Rev.: Comput. Stat. 2(1), 36–43 (2010)
Hubert, M., Debruyne, M., Rousseeuw, P.J.: Minimum covariance determinant and extensions. Wiley Interdisc. Rev.: Comput. Stat. 10(3), e1421 (2018)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP 2003, Stroudsburg, PA, USA, pp. 216–223 (2003). https://doi.org/10.3115/1119355.1119383
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Kim, S.N., Medelyan, O., Kan, M., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala, Sweden, pp. 21–26 (2010). https://aclweb.org/anthology/S/S10/S10-1004.pdf
Krapivin, M., Autayeu, A., Marchese, M.: Large dataset for keyphrases extraction. In: Technical Report DISI-09-055, Trento, Italy (2008)
Liu, F.T., Ting, K.M., Zhou, Z.: Isolation forest. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, 15–19 December 2008, pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, Massachussets, USA, pp. 366–376 (2010). https://www.aclweb.org/anthology/D10-1036
Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, pp. 257–266 (2009). https://www.aclweb.org/anthology/D09-1027
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, pp. 1318–1327 (2009). https://www.aclweb.org/anthology/D09-1137
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, Barcelona, Spain, pp. 404–411 (2004). https://www.aclweb.org/anthology/W04-3252
Moya, M.M., Hush, D.R.: Network constraints and multi-objective optimization for one-class classification. Neural Netw. 9(3), 463–474 (1996)
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41
Papagiannopoulou, E., Tsoumakas, G.: Local word vectors guiding keyphrase extraction. Inf. Process. Manag. 54(6), 888–902 (2018). https://doi.org/10.1016/j.ipm.2018.06.004
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). https://dl.acm.org/citation.cfm?id=2078195
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, pp. 1532–1543 (2014). https://aclweb.org/anthology/D/D14/D14-1162.pdf
Rousseau, F., Vazirgiannis, M.: Main core retention on graph-of-words for single-document keyword extraction. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 382–393. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_42
Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79(388), 871–880 (1984). https://doi.org/10.1080/01621459.1984.10477105
Rousseeuw, P.J., van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3), 212–223 (1999)
Rousseeuw, P.J., Hubert, M.: Robust statistics for outlier detection. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 1(1), 73–79 (2011). https://doi.org/10.1002/widm.2
Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001)
Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C.: Support vector method for novelty detection. In: Advances in Neural Information Processing Systems 12, NIPS Conference, Denver, Colorado, USA, 29 November–4 December 1999, pp. 582–588 (1999). https://papers.nips.cc/paper/1723-support-vector-method-for-novelty-detection
Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, pp. 855–860 (2008). https://www.aaai.org/Library/AAAI/2008/aaai08-136.php
Wang, R., Liu, W., McDonald, C.: Corpus-independent generic keyphrase extraction using word embedding vectors. In: Software Engineering Research Conference (2014)
Wang, R., Liu, W., McDonald, C.: Using word embeddings to enhance keyword identification for scientific publications. In: Sharaf, M.A., Cheema, M.A., Qi, J. (eds.) ADC 2015. LNCS, vol. 9093, pp. 257–268. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19548-3_21
Wille, L.T.: Review of “Learning Kernel Classifiers: Theory and Algorithms by Ralf Herbrich”. MIT Press, Cambridge (2002). 13–17, ISBN 026208306x, p. 384; and review of “learning with kernels: support vector machines, regularization optimization and beyond by Bernhard Scholkopf and Alexander J. Smola”. IT Press, Cambridge (2002). ISBN 0262194759, p. 644. SIGACT News 35(3) (2004). https://doi.org/10.1145/1027914.1027921
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Papagiannopoulou, E., Tsoumakas, G. (2023). Unsupervised Keyphrase Extraction from Scientific Publications. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-24337-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)