Abstract
Most of studies on streaming data classification are based on the assumption that data can be fully labeled. However, in real-life applications, it is impractical and time-consuming to manually label the entire stream for training. It is very common that only a small part of positive data and a large amount of unlabeled data are available in data stream environments. In this case, applying the traditional streaming algorithms with straightforward adaptation to positive unlabeled stream may not work well or lead to poor performance. In this paper, we propose a Dynamic Classifier Ensemble method for Positive and Unlabeled text stream (DCEPU) classification scenarios. We address the problem of classifying positive and unlabeled text stream with various concept drift by constructing an appropriate validation set and designing a novel dynamic weighting scheme in the classification phase. Experimental results on benchmark dataset RCV1-v2 demonstrate that the proposed method DCEPU outperforms the existing LELC (Li et al. 2009b), DVS (with necessary adaption) (Tsymbal et al. in Inf Fusion 9(1):56–68, 2008), and Stacking style ensemble-based algorithm (Zhang et al. 2008b).
Similar content being viewed by others
References
Calvo B, Larranaga P, Lozano JA (2005) Learning bayesian classifiers from positive and unlabeled examples. Pattern Recognit Lett 28(16): 2375–2384
Cheng R, Kalashnikov D, Prabhakar S (2005) Learning from positive and unlabeled examples. Theor Comput Sci 38(1): 70–83
Didaci L, Giacinto G, Roli F, Marcialis GL (2005) A study on the performances of dynamic classifier selection based on local accuracy estimation. Pattern Recognit 38(11): 2188–2191
Dietterich TG (2002) Ensemble methods in machine learning. In: Proceedings of the first international workshop on multiple classifier systems, pp 1–15
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining(KDD’00). Boston, pp 71–80
Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’04), ACM Press, pp 128–137
Fan W, Huang YA, Wang H, Yu PS (2004a) Active mining of data streams. In: Proceedings of the fourth SIAM international conference on data mining(SDM’04), pp 457–461
Fan W, Huang YA, Yu PS (2004b) Decision tree evolution using limited number of labeled data items from drifting data streams. In: Proceedings of the fourth IEEE international conference on data mining(ICDM’04), pp 379–382
Fung GPC, Yu JX, Lu H, Yu PS (2006) Text classification without negative examples revisit. IEEE Trans Knowl Data Eng 18(1): 6–20
Grossi V, Turini F (2010) Stream mining: a novel architecture for ensemble-based classification. Knowl Inf Syst: 1–35. doi:10.1007/s10115-011-0378-4
Huang S, Dong Y (2007) An active learning system for mining time-changing data streams. Intell Data Anal 11(4): 401–419
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining(KDD’01), pp 97–106
Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: Proceedings of the seventeenth international conference on machine learning(ICML’00), pp 487–494
Koa A, Sabourina R, Britto A Jr (2008) From dynamic classifier selection to dynamic ensemble selection. Pattern Recognit 41(5):1718–1731
Kolter JZ, Maloof MA (2003) Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Proceedings of the third international conference on data mining (ICDM’03), pp 123–130
Lewis DD, Yang Y, Rose TG, Dietterich G, Li F, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5: 361–397
Li C, Zhang Y, Li X (2009a) OcVFDT: one-class very fast decision tree for one-class classification of data streams. In: Proceedings of the third international workshop on knowledge discovery from sensor data. Paris, pp 79–86
Li X, Liu B (2003) Learning to classify texts using positive and unlabeled data. In: International joint conference on artificial intelligence (IJCAI’03), pp 587–594
Li X, Liu B (2005) Learning from positive and unlabeled examples with different data distributions. In: Proceedings of European conference on machine learning (ECML’05), pp 218–229
Li XL, Yu PS, Liu B, Ng SK (2009b) Positive unlabeled learning for data stream classification. In: Proceedings of the ninth SIAM international conference on data mining (SDM’09), pp 257–268
Liu B, Lee WS, Yu PS, Li X (2002) Partially supervised classification of text documents. In: Proceedings of the nineteenth international conference on machine learning (ICML’02)
Liu B, Dai Y, Li X, Lee WS, Yu PS (2003) Building text classifiers using positive and unlabeled examples. In: Proceedings of the third IEEE international conference on data mining (ICDM’03), pp 179–186
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7): 1443–1471
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Street W, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the seventh international conference on knowledge discovery and data mining (KDD’01), pp 377–382
Tsymbal A, Pechenizkiy M, Cunningham P, Puuronen S (2008) Dynamic integration of classifiers for handling concept drift. Inf Fusion 9(1): 56–68
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth international conference on knowledge discovery and data mining (KDD’03), pp 226–235
Widmer G, Kubat M (1993) Effective learning in dynamic environments by explicit context tracking. In: European conference on machine learning (ECML’93). Springer, Berlin, pp 227–243
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1): 69–101
Widyantoro D, Yen J (2005) Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Trans Knowl Data Eng 17(3): 401–412
Woods K, Kegelmeyer WP Jr, Bowyer K (1997) Combination of multiple classifiers using local accuracy estimates. IEEE Trans Pattern Anal Mach Intell 19(4): 405–410
Wozniak M (2010) A hybrid decision tree training method using data streams. Knowl Inf Syst: 1–13. doi:10.1007/s10115-010-0345-5
Wu S, Yang C, Zhou J (2006) Clustering-training for data stream mining. In: Proceedings of the sixth IEEE international conference on data mining workshops (ICDMW’06), pp 653–656
Yu H, Han J, Chang KCC (2004) PEBL: web page classification without negative examples. IEEE Trans Knowl Data Eng 16(1):70–81
Zhang B, Zuo W (2008) Learning from positive and unlabeled examples: a survey. In: International symposiums on information processing, IEEE Computer Society, Los Alamitos, pp 650–654
Zhang P, Zhu X, Shi Y (2008a) Categorizing and mining concept drifting data streams. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’08). Las Vegas, pp 812–820
Zhang Y, Jin X (2006) An automatic construction and organization strategy for ensemble learning on data streams. ACM SIGMOD Rec 35(3): 28–33
Zhang Y, Li X, Orlowska M (2008b) One-class classification of text streams with concept drift. In: Proceedings of the 2008 IEEE international conference on data mining workshops (ICDMW’08), pp 116–125
Zhou ZH, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137(1–2): 239–263
Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9(3): 339–363
Zhu X, Zhang P, Lin X, Shi Y (2007) Active learning from data streams. In: Proceedings of the seventh international conference on data mining (ICDM’07), pp 757–762
Zhu X, Ding W, Yu P, Zhang C (2010) One-class learning and concept summarization for data streams. Knowl Inf Syst: 1–31. http://dx.doi.org/10.1007/s10115-010-0331-y
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pan, S., Zhang, Y. & Li, X. Dynamic classifier ensemble for positive unlabeled text stream classification. Knowl Inf Syst 33, 267–287 (2012). https://doi.org/10.1007/s10115-011-0469-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0469-2