Abstract
An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.
Similar content being viewed by others
References
Bishop CM (2007) Pattern recognition and machine learning. Springer, New York
Cerquides J, De Mántaras RL (2005) TAN classifiers based on decomposable distributions. Mach Learn 59(3):323–354
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347
Corander J, Marttinen P (2006) Bayesian identification of admixture events using multi-locus molecular markers. Mol Ecol 15(10):2833–2843
Corander J, Marttinen P, Sirén J, Tang J (2008) Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinform 9:539
Corander J, Cui Y, Koski T (2013a) Inductive inference and partition exchangeability in classification. In: Dowe DL (ed) Solomonoff Festschrift, Springer Lecture Notes in Artificial Intelligence (LNAI), vol 7070, pp 91–105
Corander J, Cui Y, Koski T, Sirén J (2013b) Have I seen you before? Principles of Bayesian predictive classification revisited. Stat Comput 23(1):59–73
Corander J, Xiong J, Cui Y, Koski T (2013c) Optimal Viterbi Bayesian predictive classification for data from finite alphabets. J Stat Plan Infer 143(2):261–275
Dawid A, Lauritzen S (1993) Hyper-Markov laws in the statistical analysis of decomposable graphical models. Ann Stat 21:1272–1317
Dawyndt P, Thompson FL, Austin B, Swings J, Koski T, Gyllenberg M (2005) Application of sliding-window discretization and minimization of stochastic complexity for the analysis of fAFLP genotyping fingerprint patterns of Vibrionaceae. Int J Syst Evol Microbiol 55(1):57–66
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley, New York
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163
Geisser S (1964) Posterior odds for multivariate normal classifications. J R Stat Soc B 26:69–76
Geisser S (1966) Predictive discrimination. In: Krishnajah PR (ed) Multivariate analysis. Academic Press, New York
Geisser S (1993) Predictive inference: an introduction. Chapman & Hall, London
Golumbic MC (2004) Algorithmic graph theory and perfect graphs, 2nd edn. Elsevier, Amsterdam
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Helsingin Sanomat (2011) HS:n vaalikone 2011. http://www2.hs.fi/extrat/hsnext/HS-vaalikone2011.xls, visited 15 Oct 2013
Holmes DE, Jain LC (2008) Innovations in Bayesian networks: theory and applications, vol 156. Springer, Berlin
Huo Q, Lee CH (2000) A Bayesian predictive classification approach to robust speech recognition. IEEE Trans Speech Audio Process 8(2):200–204
Keogh EJ, Pazzani MJ (1999) Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches. In: Proceedings of the seventh international workshop on artificial intelligence and statistics, pp 225–230
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT Press, London
Lauritzen SL (1996) Graphical models. Oxford University Press, Oxford
Madden MG (2009) On the classification performance of TAN and general Bayesian networks. Knowl Based Syst 22(7):489–495
Maina CW, Walsh JM (2011) Joint speech enhancement and speaker identification using approximate Bayesian inference. IEEE Trans Audio Speech Lang Process 19(6):1517–1529
Nádas A (1985) Optimal solution of a training problem in speech recognition. IEEE Trans Acoustics Speech Signal Process 33(1):326–329
Nyman H, Pensar J, Koski T, Corander J (2014) Stratified graphical models—context-specific independence in graphical models. Bayesian Anal 9(4):883–908
Pernkopf F, Bilmes J (2005) Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In: Proceedings of the 22nd international conference on machine learning, pp 657–664
Ripley BD (1988) Statistical inference for spatial processes. Cambridge University Press, Cambridge
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Su J, Zhang H (2006) Full Bayesian network classifiers. In: Proceedings of the 23rd international conference on machine learning, pp 897–904
Whittaker J (1990) Graphical models in applied multivariate statistics. Wiley, Chichester
Acknowledgments
The authors would like to thank the editor and the anonymous reviewers for their constructive comments and suggestions on the original version of this paper. H.N. and J.P. were supported by the Foundation of Åbo Akademi University, as part of the grant for the Center of Excellence in Optimization and Systems Engineering. J.P. was also supported by the Magnus Ehrnrooth foundation. J.X. and J.C. were supported by the ERC Grant No. 239784 and Academy of Finland Grant No. 251170. J.X. was also supported by the FDPSS graduate school.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix A: Proof of Theorem 1
To prove Theorem 1 it suffices to consider a single class \(k\) and a single maximal clique in \(G_L^k\). If the scores for the marginal and simultaneous classifiers are asymptotically equivalent for an arbitrary maximal clique and class it automatically follows that the scores for the whole system are asymptotically equivalent. We start by considering the simultaneous classifier. The training data \({\mathbf {X}}^R\) and test data \({\mathbf {X}}^T\) are now assumed to cover only one maximal clique of an SG in one class. Looking at \(\log S_{\text {sim}}({\mathbf {X}}^T \mid {\mathbf {X}}^R)\) using (2) we get
Using Stirling’s approximation, \(\log \varGamma (x)=(x - 0.5) \log (x) - x\), this equals
When looking at the marginal classifier we need to summarize over each single observation \({\mathbf {X}}_h^T\). We use \(h(\pi _{j}^{l})\) to denote if the outcome of the parents of variable \(X_j\) belongs to group \(l\) and \(h(x_{j}^{i} \mid \pi _{j}^{l})\) to denote if the outcome of \(X_j\) is \(i\) given that the observed outcome of the parents belongs to \(l\). Observing that \(h(\pi _{j}^{l})\) and \(h(x_{j}^{i} \mid \pi _{j}^{l})\) are either 0 or 1 we get the result
Considering the difference \(\log S_{\text {sim}}({\mathbf {X}}^T \mid {\mathbf {X}}^R) - \log S_{\text {mar}}({\mathbf {X}}^T \mid {\mathbf {X}}^R)\) results in
We now make the assumption that all the limits of relative frequencies of feature values are strictly positive under an infinitely exchangeable sampling process of the training data, i.e. all hyperparameters \(\beta _{jil} \rightarrow \infty \) when the size of the training data \(m \rightarrow \infty \). Using the standard limit \(\lim _{y \rightarrow \infty } (1+x/y)^y = e^x\) results in
Appendix B: Proof of Theorem 2
This proof follows largely the same structure as the proof of Theorem 1 and covers the simultaneous score. It is assumed that the underlying graph of the SGM coincides with the GM, this is a fair assumption since when the size of the training data goes to infinity this property will hold for the SGM and GM maximizing the marginal likelihood. Again we consider only a single class \(k\) and a single maximal clique in \(G_L^k\), using the same reasoning as in the proof above. Additionally, it will suffice to consider the score for the last variable \(X_d\) in the ordering, the variable corresponding to the node associated with all of the stratified edges, and a specific parent configuration \(l\) of the parents \(\Pi _d\) of \(X_d\). The equation for calculating the score for variables \(X_1, \ldots , X_{d-1}\) will be identical using either the GM or the SGM. If the asymptotic equivalence holds for an arbitrary parent configuration it automatically holds for all parent configurations. Under this setting we start by looking at the score for the SGM
which using Stirling’s approximation and the same techniques as in the previous proof equals
When studying the GM score we need to separately consider each outcome in the parent configuration \(l\). Let \(h\) denote such an outcome in \(l\) with the total number of outcomes in \(l\) totaling \(q_l\). We then get the following score for the GM,
Which, using identical calculations as before, equals
Considering the difference \(\log S_{\text {SGM}}({\mathbf {X}}^T \mid {\mathbf {X}}^R) - \log S_{\text {GM}}({\mathbf {X}}^T \mid {\mathbf {X}}^R)\) we get
Under the assumption that \(\beta _{jil} \rightarrow \infty \) as \(m \rightarrow \infty \), the terms in rows one and four will sum to 0 as \(m \rightarrow \infty \). The remaining terms can be written
Noting that \(n(\pi _j^l) = \sum _{h=1}^{q_l} n(\pi _j^h)\) and \(n(x_j^i \mid \pi _j^l) = \sum _{h=1}^{q_l} n(x_j^i \mid \pi _j^h)\) we get
By investigating the definition of the \(\beta \) parameters in (3), in combination with the fact that the probabilities of observing the value \(i\) for variable \(X_j\) given that the outcome of the parents is \(h\) are identical for any outcome \(h\) comprising the group \(l\), we get the limits
And subsequently as \(m \rightarrow \infty \) the difference \(\log S_{\text {SGM}}({\mathbf {X}}^T \mid {\mathbf {X}}^R) - \log S_{\text {GM}}({\mathbf {X}}^T \mid {\mathbf {X}}^R) \rightarrow \)
Appendix C: List of political parties in the Finnish parliament
Table 4 contains a list of political parties elected to the Finnish parliament in the parliament elections of 2011.
Rights and permissions
About this article
Cite this article
Nyman, H., Xiong, J., Pensar, J. et al. Marginal and simultaneous predictive classification using stratified graphical models. Adv Data Anal Classif 10, 305–326 (2016). https://doi.org/10.1007/s11634-015-0199-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-015-0199-5