Abstract
In this brief tutorial, we provide an overview of investigating text-based user-generated content for information that is relevant in the corporate context. We structure the overall process along three stages: collection, analysis, and visualization. Corresponding to the stages we outline challenges and basic techniques to extract information of different levels of granularity.

Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
It is to note that each document might consist of multiple topics.
This means, the opinion “nice display” would consist of aspect “display” and sentiment “nice”.
References
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Brants T, Chen F, Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: CIKM ’02. ACM, New York, pp 211–218
Cai D, Yu S, Wen J, Ma W (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. doi:10.1145/1008992.1009070
Card SK, Mackinlay JD, Shneiderman B (1999) Readings in information visualization: using vision to think. Morgan Kaufmann, San Mateo
Chen C, Wu K, Srinivasan V, Zhang X (2011) Battling the internet water army: detection of hidden paid posters. arXiv:1111.4297 [cs.SI]
Collins M, Singer Y (1999) Unsupervised models for named entity classification, pp 100–110
Covington MA (2001) A fundamental algorithm for dependency parsing, pp 95–102
Decker R, Trusov M (2010) Estimating aggregate consumer preferences from online product reviews. Int J Res Mark 27(4):293–307
Deng C, Shipeng Y, Ji-Rong W, Wei-Ying M (2003) VIPS: a vision-based page segmentation algorithm
Evert S (2007) StupidOS: a high-precision approach to boilerplate removal. In: Building and exploring web corpora: proceedings of the 3rd web as corpus workshop, incorporating cleaneval, vol 123, p 123
Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
González-Ibáñez R, Muresan S, Wacholder N (2011) Identifying sarcasm in Twitter: a closer look, pp 581–586
Gupta S, Kaiser G, Neistadt D, Grimm P (2003) DOM-based content extraction of HTML documents
Hatzivassiloglou V, McKeown KR, Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of adjectives. In: The 35th annual meeting, pp 174–181
Jindal N, Liu B (2007) Analyzing and detecting review spam, pp 547–552
Kamps J, Marx MJ, Mokken RJ, De Rijke M (2004) Using wordnet to measure semantic orientations of adjectives
Karlgren J, Cutting D (1994) Recognizing text genres with simple metrics using discriminant analysis. Association for Computational Linguistics, pp 1071–1075
Kaser O, Lemire D (2007) Tag-cloud drawing: algorithms for cloud visualization. Preprint, arXiv:cs/0703109 [cs.DS]
Keim D, Andrienko G, Fekete JD, Görg C, Kohlhammer J, Melançon G (2008) Visual analytics: definition, process, and challenges. Inf Vis 4950:154–175
Kessler B, Numberg G, Schütze H (1997) Automatic detection of text genre. Association for Computational Linguistics, pp 32–38
Lim EP, Nguyen VA, Jindal N, Liu B, Lauw HW (2010) Detecting product review spammers using rating behaviors, pp 939–948
Liu B, Hu M, Cheng J (2005) Opinion observer: analysing and comparing opinions on the web. Chiba, Japan, pp 342–351
Liu B (2010) Sentiment analysis and subjectivity. Handbook of natural language processing. CRC Press, Boca Raton, pp 627–666
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of EMNLP, pp 79–86
Porter MF (1980) An algorithm for suffix stripping. In: Proc ACM SIGIR conference on conference on research and development in information retrieval
Qiu G, Liu B, Bu J, Chen C (2006) Opinion word expansion and target extraction through double propagation. Association for Computational Linguistics
Qiu G, Liu B, Bu J, Chen C (2009) Expanding domain sentiment lexicon through double propagation. In: Proceedings of the 21st international jont conference on artifical intelligence, pp 1199–1204
Qiu G, Liu B, Bu J, Chen C (2011) Opinion word expansion and target extraction through double propagation. Comput Linguist 37(1):9–27
Robertson SE (1981) The methodology of information retrieval experiment. In: Information retrieval experiment, pp 9–31
Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York
Schiller A (1996) Multilingual finite-state noun phrase extraction
Song R, Liu H, Wen J (2004) Learning block importance models for web pages
Steinberger J, Ebrahim M, Ehrmann M, Hurriyetoglu A, Kabadjov M, Lenkova P, Steinberger R, Tanev H, Zquez SV, Zavarella V (2012) Creating sentiment dictionaries via triangulation. Decis Support Syst 53(4):689–694
Stoyanov V, Cardie C (2006) Partially supervised coreference resolution for opinion summarization through structured rule learning
Theobald M, Siddharth J, Paepcke A (2008) SpotSigs: robust and efficient near duplicate detection in large web collections. ACM, New York, pp 563–570
Tufte ER (1990) Envisioning information. Graphics Press, Cheshire
Vickery G, Wunsch-Vincent S (2007) Participative web and user-created content: web 2.0 wikis and social networking. OECD, Paris
Wang Y, Fang B, Cheng X, Guo L, Xu H (2008) Incremental web page template detection by text segments, pp 174–180
Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis. In: CIKM ’05. ACM, New York, pp 625–631
Yu S, Cai D, Wen J, Ma W (2003) Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of the 12th international conference on world wide web
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Egger, M., Lang, A. A Brief Tutorial on How to Extract Information from User-Generated Content (UGC). Künstl Intell 27, 53–60 (2013). https://doi.org/10.1007/s13218-012-0224-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13218-012-0224-1