Active learning: an empirical study of common baselines | Data Mining and Knowledge Discovery Skip to main content

Advertisement

Log in

Active learning: an empirical study of common baselines

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Most of the empirical evaluations of active learning approaches in the literature have focused on a single classifier and a single performance measure. We present an extensive empirical evaluation of common active learning baselines using two probabilistic classifiers and several performance measures on a number of large datasets. In addition to providing important practical advice, our findings highlight the importance of overlooked choices in active learning experiments in the literature. For example, one of our findings shows that model selection is as important as devising an active learning approach, and choosing one classifier and one performance measure can often lead to unexpected and unwarranted conclusions. Active learning should generally improve the model’s capability to distinguish between instances of different classes, but our findings show that the improvements provided by active learning for one performance measure often came at the expense of another measure. We present several such results, raise questions, guide users and researchers to better alternatives, caution against unforeseen side effects of active learning, and suggest future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. In the most extremely-imbalanced synthetic datasets (1 % positive class distribution), the results were mixed across measures.

  2. Both Weka and scikit-learn have a multinomial naïve Bayes implementation for text classification.

References

  • Abe N, Mamitsuka H (1998) Query learning strategies using boosting and bagging. In: Proceedings of the international conference on machine learning (ICML), pp 1–9

  • Ali A, Caruana R, Kapoor A (2014) Active learning with model selection. In: Proceedings of the AAAI conference on artificial intelligence, pp 1673–1679

  • Arora S, Nyberg E, Rose C (2009) Estimating annotation cost for active learning in a multiannotator environment. In: NAACL HLT Workshop on Active Learning for Natural Language Processing, pp 18–26

  • Attenberg J, Melville P, Provost F (2010) A unified approach to active dual supervision for labeling features and examples. In: Proceedings of the European conference on machine learning (ECML), pp 40–55

    Chapter  Google Scholar 

  • Balcan MF, Hanneke S, Wortman J (2008) The true sample complexity of active learning. In: Proceedings of the conference on learning theory (COLT), pp 45–56

  • Baldridge J, Osborne M (2004) Active learning and the total cost of annotation. In: Proceedings of the conference on empirical methods in natural language processing, pp 9–16

  • Bay SD, Kibler DF, Pazzani MJ, Smyth P (2000) The UCI KDD archive of large data sets for data mining research and experimentation. SIGKDD Explor 2:81

    Article  Google Scholar 

  • Bilgic M (2012) Combining active learning and dynamic dimensionality reduction. In: Proceedings of the SIAM international conference on data mining (SDM)

    Chapter  Google Scholar 

  • Bilgic M, Bennett PN (2012) Active query selection for learning rankers. In: Proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 1033–1034

  • Bilgic M, Getoor L (2008) Effective label acquisition for collective classification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 43–51

  • Bilgic M, Getoor L (2009) Reflect and correct: a misclassification prediction approach to active inference. ACM Trans Knowl Discov Data 3(4):1–32

    Article  Google Scholar 

  • Bilgic M, Getoor L (2010) Active inference for collective classification. In: Proceedings of the conference on artificial intelligence (AAAI NECTAR track), pp 1652–1655

  • Bilgic M, Getoor L (2011) Value of information lattice: Exploiting probabilistic independence for effective feature subset acquisition. J Artif Intell Res (JAIR) 41:69–95

    Article  Google Scholar 

  • Bilgic M, Mihalkova L, Getoor L (2010) Active learning for networked data. In: Proceedings of the international conference on machine learning (ICML)

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MathSciNet  MATH  Google Scholar 

  • Chao C, Cakmak M, Thomaz AL (2010) Transparent active learning for robots. In: Proceedings of the ACM/IEEE international conference on human-robot interaction, pp 317–324

  • Chen D, Bilgic M, Getoor L, Jacobs D (2009) Efficient resource-constrained retrospective analysis of long video sequences. In: Proceedings of the NIPS workshop on adaptive sensing, active learning and experimental design: theory, methods and applications

  • Chen D, Bilgic M, Getoor L, Jacobs D (2011a) Dynamic processing allocation in video. IEEE Trans Pattern Anal Mach Intell 33:2174–2187

    Article  Google Scholar 

  • Chen D, Bilgic M, Getoor L, Jacobs D, Mihalkova L, Yeh T (2011b) Active inference for retrieval in camera networks. In: Proceedings of the workshop on person oriented vision

  • Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2):201–221

    Google Scholar 

  • Cohn DA, Ghahramani Z, Jordan MI (1996) Active learning with statistical models. J Artif Intell Res 4:129–145

    Article  Google Scholar 

  • Culver M, Kun D, Scott S (2006) Active learning to maximize area under the roc curve. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 149–158

  • Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Proceedings of the international conference on machine learning (ICML), pp 150–157

    Chapter  Google Scholar 

  • Dasgupta S, Monteleoni C, Hsu DJ (2007) A general agnostic active learning algorithm. In: Proceedings of the conference on advances in neural information processing systems (NIPS), pp 353–360

  • Donmez P, Carbonell JG (2008) Proactive learning: Cost-sensitive active learning with multiple imperfect oracles. In: Proceeding of the ACM conference on information and knowledge mining

  • Donmez P, Carbonell JG, Schneider J (2009) Efficiently learning the accuracy of labeling sources for selective sampling. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 259–268

  • Druck G, Settles B, McCallum A (2009) Active learning by labeling features. In: Proceedings of the conference on empirical methods in natural language processing, pp 81–90

  • Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  • Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2):133–168

    Article  Google Scholar 

  • Frey PW, Slate DJ (1991) Letter recognition using holland-style adaptive classifiers. Mach Learn 6:161

    Google Scholar 

  • Fu Y, Zhu X, Li B (2012) A survey on instance selection for active learning. Knowl Inf Syst 35(2):249–283

    Article  Google Scholar 

  • Giarratano JC, Riley G (1998) Expert Systems, 3rd edn. PWS Publishing Co., Boston

    Google Scholar 

  • Guo Y, Schuurmans D (2008) Discriminative batch mode active learning. In: Proceedings of the conference on advances in neural information processing systems (NIPS), pp 593–600

  • Guyon I, Cawley G, Dror G, Lemaire V (2011) Datasets of the active learning challenge. In: Proceedings of the JMLR workshop on active learning and experimental design, vol 16, pp 19–45

  • Haertel R, Seppi K, Ringger E, Carroll J (2008) Return on investment for active learning. In: NIPS workshop on cost-sensitive learning

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  Google Scholar 

  • Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the international conference on machine learning (ICML), pp 353–360

  • Hoi SC, Jin R, Lyu MR (2006a) Large-scale text categorization by batch mode active learning. In: Proceedings of the international conference on world wide web (WWW), pp 633–642

  • Hoi SCH, Jin R, Zhu J, Lyu MR (2006b) Batch mode active learning and its application to medical image classification. In: Proceedings of the international conference on machine learning (ICML), pp 417–424

  • Jensen D, Neville J, Gallagher B (2004) Why collective inference improves relational classification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 93–598

  • Kapoor A, Horvitz E, Basu S (2007) Selective supervision: Guiding supervised learning with decision-theoretic active learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), vol 7, pp 877–882

  • Komurlu C, Bilgic M (2016) Active inference and dynamic gaussian bayesian networks for battery optimization in wireless sensor networks. In: Proceedings of AAAI workshop on artificial intelligence for smart grids and smart buildings

  • Komurlu C, Shao J, Bilgic M (2014) Dynamic bayesian network modeling of vascularization in engineered tissues. In: Proceedings of the 11th annual conference on uncertainty in artificial intelligence—workshop on bayesian modeling applications, pp 89–98

  • Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML), pp 282–289

  • Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: Proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 3–12

    Chapter  Google Scholar 

  • Long B, Chapelle O, Zhang Y, Chang Y, Zheng Z, Tseng B (2010) Active learning for ranking through expected loss optimization. In: Proceedings of the ACM SIGIR conference on research and development in information retrieval, pp 267–274

  • Margineantu DD (2005) Active cost-sensitive learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1622–1623

  • Melville P, Mooney RJ (2004) Diverse ensembles for active learning. In: Proceedings of the international conference on machine learning (ICML), pp 584–591

  • Melville P, Sindhwani V (2009) Active dual supervision: Reducing the cost of annotating examples and features. In: Proceedings of the NAACL HLT workshop on active learning for natural language processing, pp 49–57

  • Melville P, Saar-Tsechansky M, Provost F, Mooney R (2004) Active feature-value acquisition for classifier induction. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 483–486

  • Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Proceedings of the conference on advances in neural information processing systems (NIPS), vol 14, pp 841–848

  • Pace RK, Barry R (1997) Sparse spatial autoregressions. Stat Probab Lett 33(3):291–297

    Article  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Qian B, Wang X, Wang F, Li H, Ye J, Davidson I (2013) Active learning from relative queries. In: Proceedings of the international joint conference on artificial intelligence (IJCAI)

  • Raghavan H, Allan J (2007) An interactive algorithm for asking and incorporating feature feedback into support vector machines. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 79–86

  • Ramirez-Loaiza ME (2016) Anytime active learning. PhD thesis, Illinois Institute of Technology

  • Ramirez-Loaiza ME, Culotta A, Bilgic M (2013) Towards anytime active learning: Interrupting experts to reduce annotation costs. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics (IDEA), pp 87–94

  • Ramirez-Loaiza ME, Culotta A, Bilgic M (2014) Anytime active learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2048–2054

  • Rattigan M, Maier M, Jensen D (2007) Exploiting network structure for active inference in collective classification. In: Proceedings of the international conference on machine learning workshop on mining graphs and complex structures, pp 429–434

  • Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the international conference on machine learning (ICML), pp 441–448

  • Saar-Tsechansky M, Provost F (2004) Active sampling for class probability estimation and ranking. Mach Learn 54(2):153–178

    Article  Google Scholar 

  • Schein AI, Ungar LH (2007) Active learning for logistic regression: an evaluation. Mach Learn 68(3):235–265

    Article  Google Scholar 

  • Sculley D (2007) Online active learning methods for fast label-efficient spam filtering. In: Proceedings of the conference on email and anti-spam, vol 7, pp 173–180

  • Segal R, Markowitz T, Arnold W (2006) Fast uncertainty sampling for labeling large e-mail corpora. In: Proceedings of the conference on email and anti-spam

  • Settles B (2012) Active learning. Synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, San Rafael

    MATH  Google Scholar 

  • Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the conference on empirical methods in natural language processing, pp 1070–1079

  • Settles B, Craven M, Friedland L (2008) Active learning with real annotation costs. In: Proceedings of the NIPS workshop on cost-sensitive learning

  • Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the conference on learning theory (COLT), pp 287–294

  • Sharma M, Bilgic M (2013) Most-surely vs. least-surely uncertain. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 667–676

  • Sharma M, Bilgic M (2016) Evidence-based uncertainty sampling for active learning. Data Min Knowl Discov 1–39

  • Sharma M, Zhuang D, Bilgic M (2015) Active learning with rationales for text classification. In: North American chapter of the association for computational linguistics. Human Language Technologies, pp 441–451

  • Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 614–622

  • Sindhwani V, Melville P, Lawrence RD (2009) Uncertainty sampling and transductive experimental design for active dual supervision. In: Proceedings of the international conference on machine learning (ICML), pp 953–960

  • Small K, Wallace BC, Brodley CE, Trikalinos TA (2011) The constrained weight space svm: learning with ranked features. In: Proceedings of the international conference on machine learning (ICML), pp 865–872

  • Thompson CA, Califf ME, Mooney RJ (1999) Active learning for natural language parsing and information extraction. In: Proceedings of the international conference on machine learning (ICML), pp 406–414

  • Tomanek K, Wermter J, Hahn U (2007) An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 486–495

  • Tong S, Chang E (2001) Support vector machine active learning for image retrieval. In: Proceedings of the 9th ACM international conference on multimedia, pp 107–118

  • Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66

    MATH  Google Scholar 

  • Vijayanarasimhan S, Grauman K (2009) What’s it going to cost you? predicting effort vs. informativeness for multi-label image annotations. In: Conference on computer vision and pattern recognition (CVPR), pp 2262–2269

  • Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Who should label what? Instance allocation in multiple expert active learning. In: Proceedings of the SIAM international conference on data mining (SDM), pp 176–187

    Chapter  Google Scholar 

  • Wang L (2009) Sufficient conditions for agnostic active learnable. In: Proceedings of the conference on advances in neural information processing systems (NIPS), pp 1999–2007

  • Xu Z, Yu K, Tresp V, Xu X, Wang J (2003) Representative sampling for text classification using support vector machines. Adv Inf Retr 393–407

  • Zaidan O, Eisner J, Piatko CD (2007) Using “annotator rationales” to improve machine learning for text categorization. In: Proceedings of annual conference of the north american chapter of the association for computational linguistics, Human Language Technologies, pp 260–267

  • Zhang C, Chen T (2002) An active learning framework for content-based information retrieval. IEEE Trans Multimed 4(2):260–268

    Article  Google Scholar 

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation CAREER award no. IIS-1350337.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mustafa Bilgic.

Additional information

Responsible editor: Johannes Fuernkranz.

Appendix: Open-source active learning library: PyAL

Appendix: Open-source active learning library: PyAL

We performed the experiments in this article using Weka; for naïve Bayes, we used Weka’s own implementation and for logistic regression, we used Weka’s interface to LibLinear (Fan et al. 2008), version 1.7. We later re-wrote the code in Python, integrating it with scikit-learn (Pedregosa et al. 2011) and released it as open source under the name PyAL.

We created a dedicated website for this project at http://www.cs.iit.edu/~ml/projects/empirical-study. The website currently has:

  • The Java libraries that are necessary to repeat the experiments performed in this paper

  • The synthetic datasets that were used in this study

  • The link to the GitHub repository for the PyAL library

  • A side-by-side comparison of the results obtained using the Java version of the code versus PyAL

1.1 Similarities and differences between PyAL and weka results

We repeated the main set of experiments using PyAL and compared them side-by-side with the results we obtained using Weka. To save space, we included the figures on the project website http://www.cs.iit.edu/~ml/projects/empirical-study and here, we include only the t test results in Tables 8, 9, and 10.

Table 8 UNC vs. RND using PyAL.
Table 9 QBC vs. RND using PyAL
Table 10 QBC vs. UNC using PyAL

The actual win/tie/loss counts using Weka and PyAL are not identical, but they vary very little and hence the trends and the main results obtained using Weka also hold for PyAL. We discuss some of the similarities and differences between the Weka and PyAL implementations.

1.1.1 Logistic regression results

The experiments in this paper used Weka’s interface to LibLinear version 1.7 for logistic regression. Scikit-learn’s logistic regression also uses LibLinear under the hood and hence logistic regression results are almost identical (modulo the random number sequences of Python vs.  Java) for most datasets. The biggest visible difference occurs for the California Housing dataset and we verified that this difference is due to an older version of LibLinear port that Weka used.

1.1.2 Naïve Bayes results

For naïve Bayes, scikit-learn has two generic naïve Bayes implementations: a Bernoulli naïve Bayes for datasets with binary features and a Gaussian naïve Bayes for datasets with continuous features. Weka on the other hand has a generic naïve Bayes implementation that can work with datasets that have mixed feature types: binary, continuous, and categorical.Footnote 2

For datasets that contain only binary features, both scikit-learn’s and Weka’s naïve Bayes implementations are identical. All synthetic datasets and two real datasets, Hiva and Nova, have only binary features. For these datasets, the naïve Bayes results for both PyAL and Weka are almost identical, except minor differences in random number sequences of Python vs. Java.

For datasets that contain only continuous features, both scikit-learn and Weka’s naïve Bayes implementation is Gaussian naïve Bayes. Six out of 10 real datasets have features that are all continuous (Table 2). The remaining two real datasets, KDD99 and Sylva, have mixed feature types. Weka’s implementation of naïve Bayes can handle a mix of features whereas scikit-learn’s naïve Bayes implementation requires all features to be either continuous or binary, and hence datasets need to be pre-processed to conform to one of these formats. For these datasets, there are visual differences between the learning curves, some of which are significant, between PyAL and Weka results, though as the t test tables show, the general conclusions (e.g., RND being competitive in AUC, etc.) still hold.

1.2 PyAL library

The PyAL code consists of:

  • an active learning algorithm implementation,

    figure a

    , that given the parameters for an active learning session, such as the underlying classifier, an active learning strategy, and a budget, runs an active learning session, and evaluates the classifier at each step of the active learning,

  • an active learning API, which provides the base classes for choosing a bootstrap, the base classes for choosing the next instance(s) to be labeled at each step of the labeling process, and implementation of a few active learning approaches,

  • a command-line interface, which reads the active learning settings from a command line, that loads the dataset(s), runs the

    figure b

    code, plots the results, and saves the results to files,

  • and a GUI interface written in Tkinter as a visual alternative to the command-line interface.

Currently implemented bootstrap strategies are (i) random sampling, where the initially labeled instances are chosen completely at random, and (ii) random sampling from each class, where equal number of random instances are chosen from each class. The code can be extended to implement additional bootstrap strategies, by extending the bootstrap class; for example, unsupervised batch-mode active learning strategies can be used to bootstrap the active learning process.

Currently implemented active learning approaches include uncertainty sampling (Lewis and Gale 1994), query-by-committee through bagging (Abe and Mamitsuka 1998), and expected error reduction (Roy and McCallum 2001), with a possibility to implement additional active learning strategies by extending the base strategy class.

A detailed documentation of the code, access to the GitHub repository, and Java executables can be found at http://www.cs.iit.edu/~ml/projects/empirical-study.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ramirez-Loaiza, M.E., Sharma, M., Kumar, G. et al. Active learning: an empirical study of common baselines. Data Min Knowl Disc 31, 287–313 (2017). https://doi.org/10.1007/s10618-016-0469-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0469-7

Keywords

Navigation