Abstract
Most of the empirical evaluations of active learning approaches in the literature have focused on a single classifier and a single performance measure. We present an extensive empirical evaluation of common active learning baselines using two probabilistic classifiers and several performance measures on a number of large datasets. In addition to providing important practical advice, our findings highlight the importance of overlooked choices in active learning experiments in the literature. For example, one of our findings shows that model selection is as important as devising an active learning approach, and choosing one classifier and one performance measure can often lead to unexpected and unwarranted conclusions. Active learning should generally improve the model’s capability to distinguish between instances of different classes, but our findings show that the improvements provided by active learning for one performance measure often came at the expense of another measure. We present several such results, raise questions, guide users and researchers to better alternatives, caution against unforeseen side effects of active learning, and suggest future research directions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In the most extremely-imbalanced synthetic datasets (1 % positive class distribution), the results were mixed across measures.
Both Weka and scikit-learn have a multinomial naïve Bayes implementation for text classification.
References
Abe N, Mamitsuka H (1998) Query learning strategies using boosting and bagging. In: Proceedings of the international conference on machine learning (ICML), pp 1–9
Ali A, Caruana R, Kapoor A (2014) Active learning with model selection. In: Proceedings of the AAAI conference on artificial intelligence, pp 1673–1679
Arora S, Nyberg E, Rose C (2009) Estimating annotation cost for active learning in a multiannotator environment. In: NAACL HLT Workshop on Active Learning for Natural Language Processing, pp 18–26
Attenberg J, Melville P, Provost F (2010) A unified approach to active dual supervision for labeling features and examples. In: Proceedings of the European conference on machine learning (ECML), pp 40–55
Balcan MF, Hanneke S, Wortman J (2008) The true sample complexity of active learning. In: Proceedings of the conference on learning theory (COLT), pp 45–56
Baldridge J, Osborne M (2004) Active learning and the total cost of annotation. In: Proceedings of the conference on empirical methods in natural language processing, pp 9–16
Bay SD, Kibler DF, Pazzani MJ, Smyth P (2000) The UCI KDD archive of large data sets for data mining research and experimentation. SIGKDD Explor 2:81
Bilgic M (2012) Combining active learning and dynamic dimensionality reduction. In: Proceedings of the SIAM international conference on data mining (SDM)
Bilgic M, Bennett PN (2012) Active query selection for learning rankers. In: Proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 1033–1034
Bilgic M, Getoor L (2008) Effective label acquisition for collective classification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 43–51
Bilgic M, Getoor L (2009) Reflect and correct: a misclassification prediction approach to active inference. ACM Trans Knowl Discov Data 3(4):1–32
Bilgic M, Getoor L (2010) Active inference for collective classification. In: Proceedings of the conference on artificial intelligence (AAAI NECTAR track), pp 1652–1655
Bilgic M, Getoor L (2011) Value of information lattice: Exploiting probabilistic independence for effective feature subset acquisition. J Artif Intell Res (JAIR) 41:69–95
Bilgic M, Mihalkova L, Getoor L (2010) Active learning for networked data. In: Proceedings of the international conference on machine learning (ICML)
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Chao C, Cakmak M, Thomaz AL (2010) Transparent active learning for robots. In: Proceedings of the ACM/IEEE international conference on human-robot interaction, pp 317–324
Chen D, Bilgic M, Getoor L, Jacobs D (2009) Efficient resource-constrained retrospective analysis of long video sequences. In: Proceedings of the NIPS workshop on adaptive sensing, active learning and experimental design: theory, methods and applications
Chen D, Bilgic M, Getoor L, Jacobs D (2011a) Dynamic processing allocation in video. IEEE Trans Pattern Anal Mach Intell 33:2174–2187
Chen D, Bilgic M, Getoor L, Jacobs D, Mihalkova L, Yeh T (2011b) Active inference for retrieval in camera networks. In: Proceedings of the workshop on person oriented vision
Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2):201–221
Cohn DA, Ghahramani Z, Jordan MI (1996) Active learning with statistical models. J Artif Intell Res 4:129–145
Culver M, Kun D, Scott S (2006) Active learning to maximize area under the roc curve. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 149–158
Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Proceedings of the international conference on machine learning (ICML), pp 150–157
Dasgupta S, Monteleoni C, Hsu DJ (2007) A general agnostic active learning algorithm. In: Proceedings of the conference on advances in neural information processing systems (NIPS), pp 353–360
Donmez P, Carbonell JG (2008) Proactive learning: Cost-sensitive active learning with multiple imperfect oracles. In: Proceeding of the ACM conference on information and knowledge mining
Donmez P, Carbonell JG, Schneider J (2009) Efficiently learning the accuracy of labeling sources for selective sampling. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 259–268
Druck G, Settles B, McCallum A (2009) Active learning by labeling features. In: Proceedings of the conference on empirical methods in natural language processing, pp 81–90
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2):133–168
Frey PW, Slate DJ (1991) Letter recognition using holland-style adaptive classifiers. Mach Learn 6:161
Fu Y, Zhu X, Li B (2012) A survey on instance selection for active learning. Knowl Inf Syst 35(2):249–283
Giarratano JC, Riley G (1998) Expert Systems, 3rd edn. PWS Publishing Co., Boston
Guo Y, Schuurmans D (2008) Discriminative batch mode active learning. In: Proceedings of the conference on advances in neural information processing systems (NIPS), pp 593–600
Guyon I, Cawley G, Dror G, Lemaire V (2011) Datasets of the active learning challenge. In: Proceedings of the JMLR workshop on active learning and experimental design, vol 16, pp 19–45
Haertel R, Seppi K, Ringger E, Carroll J (2008) Return on investment for active learning. In: NIPS workshop on cost-sensitive learning
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the international conference on machine learning (ICML), pp 353–360
Hoi SC, Jin R, Lyu MR (2006a) Large-scale text categorization by batch mode active learning. In: Proceedings of the international conference on world wide web (WWW), pp 633–642
Hoi SCH, Jin R, Zhu J, Lyu MR (2006b) Batch mode active learning and its application to medical image classification. In: Proceedings of the international conference on machine learning (ICML), pp 417–424
Jensen D, Neville J, Gallagher B (2004) Why collective inference improves relational classification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 93–598
Kapoor A, Horvitz E, Basu S (2007) Selective supervision: Guiding supervised learning with decision-theoretic active learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), vol 7, pp 877–882
Komurlu C, Bilgic M (2016) Active inference and dynamic gaussian bayesian networks for battery optimization in wireless sensor networks. In: Proceedings of AAAI workshop on artificial intelligence for smart grids and smart buildings
Komurlu C, Shao J, Bilgic M (2014) Dynamic bayesian network modeling of vascularization in engineered tissues. In: Proceedings of the 11th annual conference on uncertainty in artificial intelligence—workshop on bayesian modeling applications, pp 89–98
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML), pp 282–289
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 3–12
Long B, Chapelle O, Zhang Y, Chang Y, Zheng Z, Tseng B (2010) Active learning for ranking through expected loss optimization. In: Proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 267–274
Margineantu DD (2005) Active cost-sensitive learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1622–1623
Melville P, Mooney RJ (2004) Diverse ensembles for active learning. In: Proceedings of the international conference on machine learning (ICML), pp 584–591
Melville P, Sindhwani V (2009) Active dual supervision: Reducing the cost of annotating examples and features. In: Proceedings of the NAACL HLT workshop on active learning for natural language processing, pp 49–57
Melville P, Saar-Tsechansky M, Provost F, Mooney R (2004) Active feature-value acquisition for classifier induction. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 483–486
Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Proceedings of the conference on advances in neural information processing systems (NIPS), vol 14, pp 841–848
Pace RK, Barry R (1997) Sparse spatial autoregressions. Stat Probab Lett 33(3):291–297
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Qian B, Wang X, Wang F, Li H, Ye J, Davidson I (2013) Active learning from relative queries. In: Proceedings of the international joint conference on artificial intelligence (IJCAI)
Raghavan H, Allan J (2007) An interactive algorithm for asking and incorporating feature feedback into support vector machines. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 79–86
Ramirez-Loaiza ME (2016) Anytime active learning. PhD thesis, Illinois Institute of Technology
Ramirez-Loaiza ME, Culotta A, Bilgic M (2013) Towards anytime active learning: Interrupting experts to reduce annotation costs. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics (IDEA), pp 87–94
Ramirez-Loaiza ME, Culotta A, Bilgic M (2014) Anytime active learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2048–2054
Rattigan M, Maier M, Jensen D (2007) Exploiting network structure for active inference in collective classification. In: Proceedings of the international conference on machine learning workshop on mining graphs and complex structures, pp 429–434
Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the international conference on machine learning (ICML), pp 441–448
Saar-Tsechansky M, Provost F (2004) Active sampling for class probability estimation and ranking. Mach Learn 54(2):153–178
Schein AI, Ungar LH (2007) Active learning for logistic regression: an evaluation. Mach Learn 68(3):235–265
Sculley D (2007) Online active learning methods for fast label-efficient spam filtering. In: Proceedings of the conference on email and anti-spam, vol 7, pp 173–180
Segal R, Markowitz T, Arnold W (2006) Fast uncertainty sampling for labeling large e-mail corpora. In: Proceedings of the conference on email and anti-spam
Settles B (2012) Active learning. Synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, San Rafael
Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the conference on empirical methods in natural language processing, pp 1070–1079
Settles B, Craven M, Friedland L (2008) Active learning with real annotation costs. In: Proceedings of the NIPS workshop on cost-sensitive learning
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the conference on learning theory (COLT), pp 287–294
Sharma M, Bilgic M (2013) Most-surely vs. least-surely uncertain. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 667–676
Sharma M, Bilgic M (2016) Evidence-based uncertainty sampling for active learning. Data Min Knowl Discov 1–39
Sharma M, Zhuang D, Bilgic M (2015) Active learning with rationales for text classification. In: North American chapter of the association for computational linguistics. Human Language Technologies, pp 441–451
Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 614–622
Sindhwani V, Melville P, Lawrence RD (2009) Uncertainty sampling and transductive experimental design for active dual supervision. In: Proceedings of the international conference on machine learning (ICML), pp 953–960
Small K, Wallace BC, Brodley CE, Trikalinos TA (2011) The constrained weight space svm: learning with ranked features. In: Proceedings of the international conference on machine learning (ICML), pp 865–872
Thompson CA, Califf ME, Mooney RJ (1999) Active learning for natural language parsing and information extraction. In: Proceedings of the international conference on machine learning (ICML), pp 406–414
Tomanek K, Wermter J, Hahn U (2007) An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 486–495
Tong S, Chang E (2001) Support vector machine active learning for image retrieval. In: Proceedings of the 9th ACM international conference on multimedia, pp 107–118
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66
Vijayanarasimhan S, Grauman K (2009) What’s it going to cost you? predicting effort vs. informativeness for multi-label image annotations. In: Conference on computer vision and pattern recognition (CVPR), pp 2262–2269
Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Who should label what? Instance allocation in multiple expert active learning. In: Proceedings of the SIAM international conference on data mining (SDM), pp 176–187
Wang L (2009) Sufficient conditions for agnostic active learnable. In: Proceedings of the conference on advances in neural information processing systems (NIPS), pp 1999–2007
Xu Z, Yu K, Tresp V, Xu X, Wang J (2003) Representative sampling for text classification using support vector machines. Adv Inf Retr 393–407
Zaidan O, Eisner J, Piatko CD (2007) Using “annotator rationales” to improve machine learning for text categorization. In: Proceedings of annual conference of the north american chapter of the association for computational linguistics, Human Language Technologies, pp 260–267
Zhang C, Chen T (2002) An active learning framework for content-based information retrieval. IEEE Trans Multimed 4(2):260–268
Acknowledgments
This material is based upon work supported by the National Science Foundation CAREER award no. IIS-1350337.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Fuernkranz.
Appendix: Open-source active learning library: PyAL
Appendix: Open-source active learning library: PyAL
We performed the experiments in this article using Weka; for naïve Bayes, we used Weka’s own implementation and for logistic regression, we used Weka’s interface to LibLinear (Fan et al. 2008), version 1.7. We later re-wrote the code in Python, integrating it with scikit-learn (Pedregosa et al. 2011) and released it as open source under the name PyAL.
We created a dedicated website for this project at http://www.cs.iit.edu/~ml/projects/empirical-study. The website currently has:
-
The Java libraries that are necessary to repeat the experiments performed in this paper
-
The synthetic datasets that were used in this study
-
The link to the GitHub repository for the PyAL library
-
A side-by-side comparison of the results obtained using the Java version of the code versus PyAL
1.1 Similarities and differences between PyAL and weka results
We repeated the main set of experiments using PyAL and compared them side-by-side with the results we obtained using Weka. To save space, we included the figures on the project website http://www.cs.iit.edu/~ml/projects/empirical-study and here, we include only the t test results in Tables 8, 9, and 10.
The actual win/tie/loss counts using Weka and PyAL are not identical, but they vary very little and hence the trends and the main results obtained using Weka also hold for PyAL. We discuss some of the similarities and differences between the Weka and PyAL implementations.
1.1.1 Logistic regression results
The experiments in this paper used Weka’s interface to LibLinear version 1.7 for logistic regression. Scikit-learn’s logistic regression also uses LibLinear under the hood and hence logistic regression results are almost identical (modulo the random number sequences of Python vs. Java) for most datasets. The biggest visible difference occurs for the California Housing dataset and we verified that this difference is due to an older version of LibLinear port that Weka used.
1.1.2 Naïve Bayes results
For naïve Bayes, scikit-learn has two generic naïve Bayes implementations: a Bernoulli naïve Bayes for datasets with binary features and a Gaussian naïve Bayes for datasets with continuous features. Weka on the other hand has a generic naïve Bayes implementation that can work with datasets that have mixed feature types: binary, continuous, and categorical.Footnote 2
For datasets that contain only binary features, both scikit-learn’s and Weka’s naïve Bayes implementations are identical. All synthetic datasets and two real datasets, Hiva and Nova, have only binary features. For these datasets, the naïve Bayes results for both PyAL and Weka are almost identical, except minor differences in random number sequences of Python vs. Java.
For datasets that contain only continuous features, both scikit-learn and Weka’s naïve Bayes implementation is Gaussian naïve Bayes. Six out of 10 real datasets have features that are all continuous (Table 2). The remaining two real datasets, KDD99 and Sylva, have mixed feature types. Weka’s implementation of naïve Bayes can handle a mix of features whereas scikit-learn’s naïve Bayes implementation requires all features to be either continuous or binary, and hence datasets need to be pre-processed to conform to one of these formats. For these datasets, there are visual differences between the learning curves, some of which are significant, between PyAL and Weka results, though as the t test tables show, the general conclusions (e.g., RND being competitive in AUC, etc.) still hold.
1.2 PyAL library
The PyAL code consists of:
-
an active learning algorithm implementation,
, that given the parameters for an active learning session, such as the underlying classifier, an active learning strategy, and a budget, runs an active learning session, and evaluates the classifier at each step of the active learning,
-
an active learning API, which provides the base classes for choosing a bootstrap, the base classes for choosing the next instance(s) to be labeled at each step of the labeling process, and implementation of a few active learning approaches,
-
a command-line interface, which reads the active learning settings from a command line, that loads the dataset(s), runs the
code, plots the results, and saves the results to files,
-
and a GUI interface written in Tkinter as a visual alternative to the command-line interface.
Currently implemented bootstrap strategies are (i) random sampling, where the initially labeled instances are chosen completely at random, and (ii) random sampling from each class, where equal number of random instances are chosen from each class. The code can be extended to implement additional bootstrap strategies, by extending the bootstrap class; for example, unsupervised batch-mode active learning strategies can be used to bootstrap the active learning process.
Currently implemented active learning approaches include uncertainty sampling (Lewis and Gale 1994), query-by-committee through bagging (Abe and Mamitsuka 1998), and expected error reduction (Roy and McCallum 2001), with a possibility to implement additional active learning strategies by extending the base strategy class.
A detailed documentation of the code, access to the GitHub repository, and Java executables can be found at http://www.cs.iit.edu/~ml/projects/empirical-study.
Rights and permissions
About this article
Cite this article
Ramirez-Loaiza, M.E., Sharma, M., Kumar, G. et al. Active learning: an empirical study of common baselines. Data Min Knowl Disc 31, 287–313 (2017). https://doi.org/10.1007/s10618-016-0469-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0469-7