Abstract
Probabilistic Latent Semantic Indexing (PLSI) is a statistical technique for automatic document indexing. A novel method is proposed for updating PLSI when new documents arrive. The proposed method adds incrementally the words of any new document in the term-document matrix and derives the updating equations for the probability of terms given the class (i.e. latent) variables and the probability of documents given the latent variables. The performance of the proposed method is compared to that of the folding-in algorithm, which is an inexpensive, but potentially inaccurate updating method. It is demonstrated that the proposed updating algorithm outperforms the folding-in method with respect to the mean squared error between the aforementioned probabilities as they are estimated by the two updating methods and the original non-adaptive PLSI algorithm.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Yates, R.B., Neto, B.R.: Modern Information Retrieval. ACM Press, New York (1999)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal American Society of Information Science 41, 391–407 (1990)
Hofmann, T.: Probabilistic latent semantic analysis. In: Proc. Uncertainty in Artificial Intelligence, UAI 1999, Stockholm (1999)
Hofmann, T., Puzicha, J.: Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Institute, Berkeley, CA (1998)
Saul, L., Pereira, F.: Aggregate and mixed-order Markov models for statistical language processing. In: Cardie, C., Weischedel, R. (eds.) Proc. 2nd Conf. Empirical Methods in Natural Language Processing, pp. 81–89. Association for Computational Linguistics, Somerset, New Jersey (1997)
Almpanidis, G., Kotropoulos, C.: Combining text and link analysis for focused crawling. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 278–287. Springer, Heidelberg (2005)
Bartell, B.T., Cottrell, G.W., Belew, R.K.: Latent semantic indexing is an optimal special case of multidimensional scaling. In: Proc. Research and Development in Information Retrieval, pp. 161–167 (1992)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. Research and Development in Information Retrieval, pp. 50–57 (1999)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm (with discussion). Journal Royal Statistical Society, Series B 39, 1–38 (1977)
Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models, 355–368 (1999)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42, 177–196 (2001)
Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia (1999)
Lang, K.: Newsweeder: Learning to filter netnews. In: Proc. 12th Int. Conf. Machine Learning, pp. 331–339 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kotropoulos, C., Papaioannou, A. (2006). A Novel Updating Scheme for Probabilistic Latent Semantic Indexing. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds) Advances in Artificial Intelligence. SETN 2006. Lecture Notes in Computer Science(), vol 3955. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11752912_16
Download citation
DOI: https://doi.org/10.1007/11752912_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34117-8
Online ISBN: 978-3-540-34118-5
eBook Packages: Computer ScienceComputer Science (R0)