Model based clustering for mixed data: clustMD | Advances in Data Analysis and Classification
Skip to main content

Advertisement

Model based clustering for mixed data: clustMD

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Andrews DA, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New York

    Book  MATH  Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based clustering and classification of data with mixed type. Biometrics 49(3):803–821

    Article  MathSciNet  MATH  Google Scholar 

  • Browne RP, McNicholas PD (2012) Model-based clustering and classification of data with mixed type. J Stat Plan Inference 142:2976–2984

    Article  MathSciNet  MATH  Google Scholar 

  • Byar DP, Green SB (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull du Cancer 67:477–490

    Google Scholar 

  • Cagnone S, Viroli C (2012) A factor mixture analysis model for multivariate binary data. Stat Model 12:257–277

    Article  MathSciNet  Google Scholar 

  • Cai JH, Song XY, Lam KH, Ip EHS (2011) A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput Stat Data Anal 55:2889–2907

    Article  MathSciNet  MATH  Google Scholar 

  • Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793

    Article  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Everitt BS (1988) A finite mixture model for the clustering of mixed-mode data. Stat Probab Lett 6:305–309

    Article  MathSciNet  Google Scholar 

  • Fox JP (2010) Bayesian Item Response Modeling. Springer, New York

    Book  MATH  Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington

  • Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New York

    MATH  Google Scholar 

  • Geweke J, Keane M, Runkle D (1994) Alternative computational approaches to inference in the multinomial probit model. Rev Econ Stat 76(4):609–632

    Article  Google Scholar 

  • Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588

  • Gruhl J, Erosheva EA, Crane P (2013) A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann Appl Stat 7(2):2361–2383

    Article  MathSciNet  MATH  Google Scholar 

  • Hunt L, Jorgensen M (1999) Mixture model clustering using the multimix program. Aust N Z J Stat 41:153–171

    Article  MATH  Google Scholar 

  • Johnson VE, Albert JH (1999) Ordinal data modeling. Springer, New York

    MATH  Google Scholar 

  • Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83

    Article  MathSciNet  Google Scholar 

  • Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795

    Article  MathSciNet  MATH  Google Scholar 

  • Kosmidis I, Karlis D (2015) Model-based clustering using copulas with applications. Stat Comput 1–21. doi:10.1007/s11222-015-9590-5

  • Lawrence CJ, Krzanowski WJ (1996) Mixture separation for mixed-mode data. Stat Comput 6:85–92

    Article  Google Scholar 

  • Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:1405.1299 (preprint)

  • McLachlan G, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Advances in pattern recognition, vol 1451. Springer, Berlin, pp 658–666

    Chapter  Google Scholar 

  • McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New Jersey

    Book  MATH  Google Scholar 

  • McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New Jersey

    Book  MATH  Google Scholar 

  • McParland D, Gormley IC (2013) Clustering ordinal data via latent variable models. In: Van den Poel D, Ultsch A, Lausen B (eds) Algorithms from and for nature and life. Springer, Berlin, pp 127–135

    Chapter  Google Scholar 

  • McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW, Collinson MA (2014a) Clustering South African households based on their asset status using latent variable models. Ann Appl Stat 8(2):747–776

    Article  MathSciNet  MATH  Google Scholar 

  • McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College Dublin

  • Morlini I (2011) A latent variable approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Adv Data Anal Classif 6(1):5–28

    Article  MathSciNet  MATH  Google Scholar 

  • Murray JS, Dunson DB, Carin L, Lucas JE (2013) Bayesian Gaussian copula factor models for mixed data. J Am Stat Assoc 108(502):656–665

    Article  MathSciNet  MATH  Google Scholar 

  • Muthén B, Shedden K (1999) Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55:463–469

    Article  MATH  Google Scholar 

  • O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College Dublin

  • O’Hagan A, Murphy TB, Gormley IC (2012) Computational aspects of ftting mixture models via the expectation-maximisation algorithm. Comput Stat Data Anal 56(12):3843–3864

    Article  MathSciNet  MATH  Google Scholar 

  • Quinn KM (2004) Bayesian factor analysis for mixed ordinal and continuous responses. Political Anal 12(4):338–353

    Article  MathSciNet  Google Scholar 

  • R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New Jersey

    MATH  Google Scholar 

  • Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704

    Article  Google Scholar 

  • Willse A, Boik RJ (1999) Identifiable finite mixtures of location models for clustering mixed-mode data. Stat Comput 9:111–121

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to thank the coordinating editor and reviewers for their comments, which greatly improved this work. The authors would also like to thank the members of the Working Group in Model Based Clustering and the members of the Working Group in Statistical Learning for helpful discussions. This work is supported by Science Foundation Ireland under the Research Frontiers Programme (09/RFP/MTH2367) and the Insight Research Centre (SFI/12/RC/2289).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isobel Claire Gormley.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 684 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

McParland, D., Gormley, I.C. Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10, 155–169 (2016). https://doi.org/10.1007/s11634-016-0238-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0238-x

Keywords

Mathematics Subject Classification