Abstract
The problem of Text Classification (TC) has been studied for decades, and this problem is particularly interesting because the features are derived from syntactic or semantic indicators, while the classification, in and of itself, is based on statistical Pattern Recognition (PR) strategies. Thus, all the recorded TC schemes work using the fundamental paradigm that once the statistical features are inferred from the syntactic/semantic indicators, the classifiers themselves are the well-established ones such as the Bayesian, the Naïve Bayesian, the SVM etc. and those that are neural or fuzzy. In this paper, we shall demonstrate that by virtue of the skewed distributions of the features, one could advantageously work with information latent in certain “non-central” quantiles (i.e., those distant from the mean) of the distributions. We, indeed, demonstrate that such classifiers exist and are attainable, and show that the design and implementation of such schemes work with the recently-introduced paradigm of Quantile Statistics (QS)-based classifiers(The foundational properties for CMQS (for generic and some straightforward distributions) were initially described in [17]. Their properties for uni-dimensional distributions of the exponential family are included in [9], and for multi-dimensional distributions in [18]. The authors of [17], [9] and [18] had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were rather based on their Quantile Statistics.). These classifiers, referred to as Classification by Moments of Quantile Statistics (CMQS), are essentially “Anti”-Bayesian in their modus operandi. To achieve our goal, in this paper we demonstrate the power and potential of CMQS to describe the very high-dimensional TC-related vector spaces in terms of a limited number of “outlier-based” statistics. Thereafter, the PR task in classification invokes the CMQS classifier for the underlying multi-class problem by using a linear number of pair-wise CMQS-based classifiers. By a rigorous testing on the standard 20-Newsgroups corpus we show that CMQS-based TC attains accuracy that is comparable to the best-reported classifiers. We also propose the potential of fusing the results of a CMQS-based methodology with those obtained from a more traditional scheme.
The authors are grateful for the partial support provided by NSERC, the Natural Sciences and Engineering Research Council of Canada. A preliminary version of this paper was presented at ICCCI’15, the 2015 International Conference on Computational Collective Intelligence Technologies and Applications, in Madrid, Spain, in September 2015. The paper was a Plenary/Keynote Talk at the conference. The first author is also an Adjunct Professor with the University of Agder in Grimstad, Norway.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
SMART is an abbreviation for Salton’s Magic Automatic Retriever of Text.
- 2.
The formal definitions for the TF and the TFIDF are given in Sect. 4.3.
- 3.
Since the static TFIDF weighting scheme presented above becomes inefficient when the system has documents that are continuously arriving, for example, systems used for online detection, the literature also reports the use of the Adaptive TFIDF. The Adaptive IDF can be efficiently used for document retrieval after a sufficient number of “past” documents have been processed. The initial IDF values are calculated using a retrospective corpus of documents, and these IDF values are then updated incrementally. The literature also reports other metrics of comparison, such as the Jaccard similarity, but since this is not the primary concern of this paper, we will not elaborate on these here.
- 4.
- 5.
As mentioned earlier, the authors of [17], [9] and [18] (cited in their chronological order) had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were, rather, based on their Quantile Statistics.
- 6.
- 7.
In all the cases, they worked with the assumption that the a priori distributions were identical.
- 8.
The documents used in this test were very short, which explains why the histograms are heavily skewed in favour of lower word frequencies.
- 9.
Given that these extreme points give better results in the next experiment when we classify using the TFIDF criteria (instead of merely the TF criteria), we hypothesize that this poor behavior is probably due to noise from non-significant words that is somehow amplified in the extreme CMQS points. But this issue is still unresolved.
References
Alahmadi, A., Joorabchi, A., Mahdi, A.E.: A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: Proceedings of the 7th IEEE GCC Conference and Exhibition, Doha, Qatar, pp. 108–113, November 2014
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM Symposium on Applied Computing, Melbourne USA, pp. 784–788, March 2003
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. A Wiley Interscience Publication, New York (2006)
Dumoulin, J.: Smoothing of n-gram language models of human chats. In: Proceedings of the Joint 6th International Conference on Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS), Kobe, Japan, pp. 1–4, November 2012
Lu, L., Liu, Y.-S.: Research of english text classification methods based on semantic meaning. In: Proceedings of the ITI 3rd International Conference on Information and Communications Technology, Cairo, Egypt, pp. 689–700, December 2005
Madsen, R.E., Sigurdsson, S., Hansen, L.K., Larsen, J.: Pruning the vocabulary for better context recognition. In: Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, vol. 2, pp. 483–488, August 2004
Menon, R., Keerthi, S.S., Loh, H.T., Brombacher, A.C.: On the effectiveness of latent semantic analysis for the categorization of call centre records. In: Proceedings of the IEEE International Engineering Management Conference, Singapore, vol. 2, pp. 545–550 (2004)
Ning, Y., Zhu, T., Wang, Y.: Affective-word based chinese text sentiment classification. In: Proceedings of the 5th International Conference on Pervasive Computing and Applications (ICPCA), Maribor, Slovenia, pp. 111–115, December 2010
Oommen, B.J., Thomas, A.: Optimal order statistics-based “Anti-Bayesian” parametric pattern classification for the exponential family. Pattern Recogn. 47, 40–55 (2014)
Ouamour, S., Sayoud, H.: Authorship attribution of ancient texts written by ten arabic travelers using character N-Grams. In: Proceedings of the 2013 International Conference on Computer, Information and Telecommunication Systems (CITS), Piraeus-Athens, Greece, pp. 1–5, May 2013
Qiang, G.: An effective algorithm for improving the performance of Naïve Bayes for text classification. In: Proceedings of the Second International Conference on Computer Research and Development, Kuala Lumpur, Malaysia, pp. 699–701, May 2010
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. Mc-Graw Hill Book Company, New York (1983)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18, 613–620 (1975)
Salton, G., Yang, C.S., Yu, C.: A theory of term importance in automatic text analysis. Technical report, Ithaca, NY, USA (1974)
Salton, G., Yang, C.S., Yu, C.: Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA (1987)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Thomas, A., Oommen, B.J.: The fundamental theory of optimal “Anti-Bayesian” parametric pattern classification using order statistics criteria. Pattern Recogn. 46, 376–388 (2013)
Thomas, A., Oommen, B.J.: Order statistics-based parametric classification for multi-dimensional distributions. Pattern Recogn. 46, 3472–3482 (2013)
Thomas, A., Oommen, B.J.: Corrigendum to three papers that deal with “Anti”-Bayesian pattern recognition. Pattern Recogn. 47, 2301–2302 (2014)
Thomas, A., Oommen, B.J.: A novel border identification algorithm based on an “Anti-Bayesian” paradigm. In: Proceedings of CAIP’13, the 2013 International Conference on Computer Analysis of Images and Patterns, York, UK, pp. 196–203, August 2013
Thomas, A., Oommen, B.J.: Ultimate order statistics-based prototype reduction schemes. In: Proceedings of AI 2013, The 2013 Australasian Joint Conference on Artificial Intelligence, Dunedin, New Zealand, pp. 421–433, December 2013
Wu, G., Liu, K.: Research on text classification algorithm by combining statistical and ontology methods. In: Proceedings of the International Conference on Computational Intelligence and Software Engineering, Wuhan, China, pp. 1–4, December 2009
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag GmbH Germany
About this chapter
Cite this chapter
Oommen, B.J., Khoury, R., Schmidt, A. (2016). Text Classification Using “Anti”-Bayesian Quantile Statistics-Based Classifiers. In: Nguyen, N., Kowalczyk, R., Orłowski, C., Ziółkowski, A. (eds) Transactions on Computational Collective Intelligence XXV. Lecture Notes in Computer Science(), vol 9990. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53580-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-53580-6_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53579-0
Online ISBN: 978-3-662-53580-6
eBook Packages: Computer ScienceComputer Science (R0)