Abstract
The problem of merging Gaussian mixture components is discussed in situations where a Gaussian mixture is fitted but the mixture components are not separated enough from each other to interpret them as “clusters”. The problem of merging Gaussian mixtures is not statistically identifiable, therefore merging algorithms have to be based on subjective cluster concepts. Cluster concepts based on unimodality and misclassification probabilities (“patterns”) are distinguished. Several different hierarchical merging methods are proposed for different cluster concepts, based on the ridgeline analysis of modality of Gaussian mixtures, the dip test, the Bhattacharyya dissimilarity, a direct estimator of misclassification and the strength of predicting pairwise cluster memberships. The methods are compared by a simulation study and application to two real datasets. A new visualisation method of the separation of Gaussian mixture components, the ordered posterior plot, is also introduced.
Similar content being viewed by others
References
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821
Baudry JP, Raftery AE, Celeux G, Lo K, Gottardo R (2008) Combining mixture components for clustering. Technical report 540, University of Washington, Seattle
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE T Pattern Anal 22: 719–725
Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus Leptograpsus. Aust J Zool 22: 417–425
Davies PL (1995) Data features. Stat Neerl 49: 185–245
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97: 611–631
Fraley C, Raftery AE (2003) Enhanced software for model-based clustering, density estimation and discriminant analysis. J Classif 20: 263–286
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press, New York
Hartigan JA, Hartigan PM (1985) The dip test of unimodality. Ann Stat 13: 70–84
Hastie T, Tibshirani R (1996) Discriminant analysis by Gaussian mixtures. J Roy Stat Soc B Met 58: 155–176
Hennig C (2005) Asymmetric linear dimension reduction for classification. J Comput Graph Stat 13: 930–945
Hennig C (2010) Ridgeline plot and clusterwise stability as tools for merging Gaussian mixture components. In: Locarek-Junge H, Weihs C (eds) Classification as a tool for research. Springer, Berlin, accepted for publication
Hennig C, Coretto P (2008) The noise component in model-based cluster analysis. In: Preisach C, Burkhard H, Schmidt-Thieme L, Decker R (eds) Data analysis, machine learning and applications. Springer, Berlin, pp 127–138
Keribin C (2000) Consistent estimation of the order of a mixture model. Sankhya Ser A 62: 49–66
Li J (2004) Clustering based on a multilayer mixture model. J Comput Graph Stat 14: 547–568
Matusita K (1971) Some properties of affinity and applications. Ann I Stat Math 23: 137–155
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Qiu W, Joe H (2006) Generation of random clusters with specified degree of separation. J Classif 23: 315–334
Ray S, Lindsay BG (2005) The topography of multivariate normal mixtures. Ann Stat 33: 2042–2065
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26: 195–239
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6: 461–464
Street WN, Wolberg WH, Mangasarian OL (1993) Nuclear feature extraction for breast tumor diagnosis. IS & T/SPIE 1993 international symposium on electronic imaging: science and technology, vol 1905, San Jose, CA, pp 861–870
Tantrum J, Murua A, Stuetzle W (2003) Assessment and pruning of hierarchical model based clustering. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, pp 197–205
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14: 511–528
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset via the gap statistic. J Roy Stat Soc B Met 63: 411–423
Ueda N, Nakano R, Ghahramani Z, Hinton GE (2000) SMEM algorithm for mixture models. Neural Comput 12: 2109–2128
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hennig, C. Methods for merging Gaussian mixture components. Adv Data Anal Classif 4, 3–34 (2010). https://doi.org/10.1007/s11634-010-0058-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-010-0058-3