Abstract
User profiling on social media data is normally done within a supervised setting. A typical feature of supervised models that are trained on data from a specific genre, is their limited portability to other genres. Cross-genre models were developed in the context of PAN 2016, where systems were trained on tweets, and tested on other non-tweet social media data. Did the model that achieved best results at this task got lucky or was it truly designed in a cross-genre manner, with features general enough to capture demographics beyond Twitter? We explore this question via a series of in-genre and cross-genre experiments on English and Spanish using the best performing system at PAN 2016, and discover that portability is successful to a certain extent, provided that the sub-genres involved are close enough. In such cases, it is also more beneficial to do cross-genre than in-genre modelling if the cross-genre setting can benefit from larger amounts of training data than those available in-genre.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Contrary to what was stated in [4], TweetTokenizer is used for blogs, not word_tokenizer.
- 3.
In this paper we do not investigate the contribution of individual features, but insights on this can be found in the detailed description of GronUP [4].
- 4.
- 5.
The choice for these languages is due to the availability of data, but the model could be trained on any language for which preprocessing tools (tokenizer, PoS-tagging, dictionary) and training data are available.
- 6.
References
Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y Gómez, M., Villaseñor-Pineda, L., Jair-Escalante, H.: INAOE’s participation at PAN’15: author profiling task. In: Proceedings of CLEF (2015)
Bird, S., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol (2009)
Brants, T.: TnT: a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 224–231. Association for Computational Linguistics (2000)
Busger op Vollenbroek, M., Carlotto, T., Kreutz, T., Medvedeva, M., Pool, C., Bjerva, J., Haagsma, H., Nissim, M.: GronUP: Groningen user profiling. In: Working Notes of CLEF, pp. 846–857. CEUR Workshop Proceedings. CEUR-WS.org (2016)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Perkins, J.: Python 3 Text Processing with NLTK 3 Cookbook. Packt Publishing Ltd. (2014)
Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: CLEF (2015)
Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the author profiling task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) Working Notes for CLEF 2014 Conference, Sheffield, 15–18 September 2014. CEUR Workshop Proceedings, vol. 1180, pp. 898–927. CEUR-WS.org (2014)
Rangel, F., Rosso, P., Moshe Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pp. 352–365. CELCT (2013)
Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings. CLEF and CEUR-WS.org, September 2016
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging, vol. SS-06-03, pp. 191–197. AAAI Press (2006)
Schnoebelen, T.: Do you smile with your nose? Stylistic variation in Twitter emoticons. Univ. Pa. Work. Pap. Linguist. 18(2), 117–125 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Medvedeva, M., Haagsma, H., Nissim, M. (2017). An Analysis of Cross-Genre and In-Genre Performance for Author Profiling in Social Media. In: Jones, G., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science(), vol 10456. Springer, Cham. https://doi.org/10.1007/978-3-319-65813-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-65813-1_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65812-4
Online ISBN: 978-3-319-65813-1
eBook Packages: Computer ScienceComputer Science (R0)