Abstract
Applying text classification to find social media posts relevant to a topic of interest is the focus of a substantial amount of research. A key challenge is how to select a good training set of posts to label. This problem has traditionally been solved using active learning. However, this assumes access to all posts of the collection, which is not realistic in many cases, as social networks impose constraints on the number of posts that can be retrieved through their search APIs. To address this problem, which we refer as the training post retrieval over constrained search interfaces problem, we propose several keyword selection algorithms that, given a topic, generate an effective set of keyword queries to submit to the search API. The returned posts are labeled and used as a training dataset to train post classifiers. Our experiments compare our proposed keyword selection algorithms to several baselines across various topics from three sources. The results show that the proposed methods generate superior training sets, which is measured by the balanced accuracy of the trained classifiers.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability Statement
The datasets used in our experiments were collected from DailyStrength, Reddit, and Misra (2018).
Notes
This value was used by the experiments in Wang et al. (2016).
References
Ahmad S, Asghar MZ, Alotaibi FM, Awan I (2019) Detection and classification of social media-based extremist affiliations using sentiment analysis techniques. Hum Cent Comput Inf Sci 9:24
Balsamo D, Bajardi P, Panisson A (2019) Firsthand opiates abuse on social media: monitoring geospatial patterns of interest through a digital cohort. Proc WWW 2019:2572–2579
Bissoyi S, Mishra BK, Patra MR (2016) Recommender systems in a patient centric social network—a survey. Proc SCOPES 2016:386–389
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–46
Brodersen KH, Ong CS, Stephan KE, Buhmann JM (2010) The balanced accuracy and its posterior distribution. Proc ICPR 2010:3121–3124
Croft WB, Metzler D, Strohman T (2010) Search engines: information retrieval in practice. Addison-Wesley, Boston
de Lira VM, Macdonald C, Ounis I, Perego R, Renso C, Times VC (2019) Event attendance classification in social media. Inform Process Manag 56(3):687–703
Elkan C, Noto K (2008) Learning classifiers from only positive and unlabeled data. Proc SIGKDD 2008:213–220
Goudjil M, Koudil M, Bedda M, Ghoggali N (2018) A novel active learning method using SVM for text classification. Int J Automat Comput 15(3):290–298
Kim Y (2014) Convolutional neural networks for sentence classification. Proc EMNLP 2014:1746–1751
Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Li C, Xing J, Sun A, Ma Z (2016) Effective document labeling with very few seed words: a topic model approach. Proc CIKM 2016:85–94
Li C, Zhou W, Ji F, Duan Y, Chen H (2018) A deep relevance model for zero-shot document filtering. In: Proc 56th annu meeting ACL, pp 2300–2310
Li H, Liu B, Mukherjee A, Shao J (2014) Spotting fake reviews using positive-unlabeled learning. Comput Sist 18(3):467–475
Li R, Wang S, Cheng KC (2013) Towards social data platform: automatic topic-focused monitor for Twitter stream. Proc VLDB Endow 6(14):1966–1977
Li X, Liu B (2003) Learning to classify texts using positive and unlabeled data. Proc IJCAI 2003:587–592
Misra R (2018) News category dataset. ResearchGate. https://doi.org/10.13140/RG.2.2.20331.18729
Pearson K (1895) Note on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58(347–352):240–242
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Pohl D, Bouchachia A, Hellwagner H (2018) Batch-based active learning: application to social media data for crisis management. Expert Syst Appl 93:232–244
Proskurnia J, Mavlyutov R, Castillo C, Aberer K, Cudre-Mauroux P (2017) Efficient document filtering using vector space topic expansion and pattern-mining: the case of event detection in microposts. Proc CIKM 2017:457–466
Rao J, Yang W, Zhang Y, Ture F, Lin J (2019) Multi-perspective relevance matching with hierarchical convnets for social media search. In: Proc 33rd AAAI conf artif intell, pp 232–240
Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proc LREC 2010 workshop new challenges NLP frameworks, pp 45–50
Rivas R, Sadah SA, Guo Y, Hristidis V (2020) Classification of health-related social media posts: evaluation of post content classifier models and analysis of user demographics. JMIR Pub Health Surv 6(2):e14952
Ruiz E, Hristidis V, Ipeirotis PG (2014) Efficient filtering on hidden document streams. In: Proc ICWSM
Sadri M, Mehrotra S, Yu Y (2016) Online adaptive topic focused tweet acquisition. Proc. CIKM 2016:2353–2358
Shen S, Murzintcev N, Song C, Cheng C (2017) Information retrieval of a disaster event from cross-platform social media. Inf Discov Deliv 45(4):220–226
Smailovic J, Grcar M, Lavrac N, Znidarsic M (2014) Stream-based active learning for sentiment analysis in the financial domain. Inf Sci 285:181–203
Thorndike RL (1953) Who belongs in the family? Psychometrika 18(4):267–276
Wang S, Chen Z, Liu B, Emery S (2016) Identifying search keywords for finding relevant social media posts. In: Proc 30th AAAI conf artif intell, pp 3052–3058
Zhang Y, Lease M, Wallace BC (2017) Active discriminative text representation learning. In: Proc 31st AAAI conf artif intell, pp 3386–3392
Zhang Y, Zhao P, Cao J, Ma W, Huang J, Wu Q, Tan M (2018) Online adaptive asymmetric active learning for budgeted imbalanced data. Proc SIGKDD 2018:2768–2777
Funding
This work was supported by NSF Grants IIS-1838222 and IIS-1901379.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Code availability
The code used in our experiments is available from: https://github.com/rriva002/Training-Post-Retrieval.
Additional information
Responsible editor: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Rivas, R., Hristidis, V. Effective social post classifiers on top of search interfaces. Data Min Knowl Disc 35, 1809–1829 (2021). https://doi.org/10.1007/s10618-021-00768-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00768-2