Abstract
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.
Similar content being viewed by others
References
Andrews DA, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New York
Banfield JD, Raftery AE (1993) Model-based clustering and classification of data with mixed type. Biometrics 49(3):803–821
Browne RP, McNicholas PD (2012) Model-based clustering and classification of data with mixed type. J Stat Plan Inference 142:2976–2984
Byar DP, Green SB (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull du Cancer 67:477–490
Cagnone S, Viroli C (2012) A factor mixture analysis model for multivariate binary data. Stat Model 12:257–277
Cai JH, Song XY, Lam KH, Ip EHS (2011) A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput Stat Data Anal 55:2889–2907
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38
Everitt BS (1988) A finite mixture model for the clustering of mixed-mode data. Stat Probab Lett 6:305–309
Fox JP (2010) Bayesian Item Response Modeling. Springer, New York
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington
Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New York
Geweke J, Keane M, Runkle D (1994) Alternative computational approaches to inference in the multinomial probit model. Rev Econ Stat 76(4):609–632
Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588
Gruhl J, Erosheva EA, Crane P (2013) A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann Appl Stat 7(2):2361–2383
Hunt L, Jorgensen M (1999) Mixture model clustering using the multimix program. Aust N Z J Stat 41:153–171
Johnson VE, Albert JH (1999) Ordinal data modeling. Springer, New York
Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795
Kosmidis I, Karlis D (2015) Model-based clustering using copulas with applications. Stat Comput 1–21. doi:10.1007/s11222-015-9590-5
Lawrence CJ, Krzanowski WJ (1996) Mixture separation for mixed-mode data. Stat Comput 6:85–92
Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:1405.1299 (preprint)
McLachlan G, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Advances in pattern recognition, vol 1451. Springer, Berlin, pp 658–666
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New Jersey
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New Jersey
McParland D, Gormley IC (2013) Clustering ordinal data via latent variable models. In: Van den Poel D, Ultsch A, Lausen B (eds) Algorithms from and for nature and life. Springer, Berlin, pp 127–135
McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW, Collinson MA (2014a) Clustering South African households based on their asset status using latent variable models. Ann Appl Stat 8(2):747–776
McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College Dublin
Morlini I (2011) A latent variable approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Adv Data Anal Classif 6(1):5–28
Murray JS, Dunson DB, Carin L, Lucas JE (2013) Bayesian Gaussian copula factor models for mixed data. J Am Stat Assoc 108(502):656–665
Muthén B, Shedden K (1999) Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55:463–469
O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College Dublin
O’Hagan A, Murphy TB, Gormley IC (2012) Computational aspects of ftting mixture models via the expectation-maximisation algorithm. Comput Stat Data Anal 56(12):3843–3864
Quinn KM (2004) Bayesian factor analysis for mixed ordinal and continuous responses. Political Anal 12(4):338–353
R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New Jersey
Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704
Willse A, Boik RJ (1999) Identifiable finite mixtures of location models for clustering mixed-mode data. Stat Comput 9:111–121
Acknowledgments
The authors wish to thank the coordinating editor and reviewers for their comments, which greatly improved this work. The authors would also like to thank the members of the Working Group in Model Based Clustering and the members of the Working Group in Statistical Learning for helpful discussions. This work is supported by Science Foundation Ireland under the Research Frontiers Programme (09/RFP/MTH2367) and the Insight Research Centre (SFI/12/RC/2289).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
McParland, D., Gormley, I.C. Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10, 155–169 (2016). https://doi.org/10.1007/s11634-016-0238-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0238-x