Abstract
An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3(4-5), 993–1022 (2003)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101 (suppl. 1), 5228–5335 (2004)
Blei, D.M., Lafferty, J.D.: Correlated topic models. Advances in Neural Information Processing Systems 18 (2006)
Li, S.Z.: Markov Random Field Modeling in Image Analysis. Advances in Pattern Recognition. Springer (2009)
Chang, J., Blei, D.M.: Hierarchical relational models for document networks. Annals of Applied Statistics 4(1), 124–150 (2010)
Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM, New York (2006)
Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM, New York (2006)
Wang, C., Blei, D.M., Heckerman, D.: Continuous time dynamic topic models. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (2008)
Blei, D.M., McAuliffe, J.D.: Supervised topic models. Advances in Neural Information Processing Systems 22 (2007)
Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. In: Advances in Neural Information Processing Systems, vol. 20 (2008)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press, Arlington (2004)
Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., Steyvers, M.: Learning author-topic models from text corpora. ACM Trans. Inf. Syst. 28, 1–38 (2010)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Journal of the American Statistical Association 101(476), 1566–1581 (2004)
Blei, D.M., Jordan, M.I., Griffiths, T.L., Tennenbaum, J.B.: Hierarchical topic models and the nested chinese restaurant process. Advances in Neural Information Processing Systems 13 (2004)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related groups: Hierarchical Dirichlet processes. Advances in Neural Information Processing Systems 17, 1385–1392 (2005)
Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The IBP compound Dirichlet process and its application to focused topic modeling. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1151–1158 (2010)
Chen, X., Zhou, M., Carin, L.: The contextual focused topic model. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 96–104. ACM, New York (2012)
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In: Proc. 26th Annual International Conference on Machine Learning, ICML 2009, pp. 25–32. ACM, New York (2009)
Andrzejewski, D., Zhu, X.: Latent Dirichlet allocation with topic-in-set knowledge. In: Proc. NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, SemiSupLearn 2009, pp. 43–48. Association for Computational Linguistics, Stroudsburg (2009)
Barth, F.: Introduction. In: Barth, F. (ed.) Ethnic Groups and Boundaries: The Social Organization of Culture Difference, pp. 9–38. George Allen and Unwin, London (1969)
Hechter, M.: Internal colonialism: the Celtic fringe in British national development, pp. 1536–1966. Routledge & Kegan Paul, London (1975)
Hall, S.: Ethnicity: Identity and difference. Radical America 23(4), 9–22 (1991)
Voltmer, K.: The Media in Transitional Democracies. Polity, Cambridge (2013)
Nyamnjoh, F.B.: Africa’s Media, Democracy and the Politics of Belonging. Zed Books, London (2005)
ter Wal, J. (ed.): Racism and cultural diversity in the mass media: An overview of research and examples of good practice in the EU member states, 1995-2000, pp. 1995–2000. European Monitoring Centre on Racism and Xenofobia, Vienna (2002)
Downing, J.D.H., Husbands, C.: Representing Race: Racisms, Ethnicity and the Media. Sage, London (2005)
Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning, pp. 1105–1112. ACM, New York (2009)
Wallach, H.M.: Structured topic models for language. PhD thesis, University of Cambridge (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bodrunova, S., Koltsov, S., Koltsova, O., Nikolenko, S., Shimorina, A. (2013). Interval Semi-supervised LDA: Classifying Needles in a Haystack. In: Castro, F., Gelbukh, A., González, M. (eds) Advances in Artificial Intelligence and Its Applications. MICAI 2013. Lecture Notes in Computer Science(), vol 8265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45114-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-45114-0_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45113-3
Online ISBN: 978-3-642-45114-0
eBook Packages: Computer ScienceComputer Science (R0)