Abstract
Cluster-weighted models represent a convenient approach for model-based clustering, especially when the covariates contribute to defining the cluster-structure of the data. However, applicability may be limited when the number of covariates is high and performance may be affected by noise and outliers. To overcome these problems, common/uncommon \(t\)-factor analyzers for the covariates, and a \(t\)-distribution for the response variable, are here assumed in each mixture component. A family of twenty parsimonious variants of this model is also presented and the alternating expectation-conditional maximization algorithm, for maximum likelihood estimation of the parameters of all models in the family, is described. Artificial and real data show that these models have very good clustering performance and that the algorithm is able to recover the parameters very well.



Similar content being viewed by others
References
Airoldi J, Hoffmann R (1984) Age variation in voles (Microtus californicus, M. ochrogaster) and its significance for systematic studies. Occasional papers of the Museum of Natural History 111, University of Kansas, Lawrence, KS
Aitken AC (1926) On Bernoulli’s numerical solution of algebraic equations. In: Proceedings of the Royal Society of Edinburgh, vol 46, pp 289–305
Andrews JL, McNicholas PD (2011) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21(3):361–373
Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27(9):1269–1276
Baek J, McLachlan GJ, Flack LK (2010) Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high-dimensional data. IEEE Trans Pattern Anal Mach Intell 32(7):1298–1309
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay B (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
DeSarbo WS, Cron WL (1988) A maximum likelihood methodology for clusterwise linear regression. J Classif 5(2):249–282
Ehrlich I (1973) Participation in illegitimate activities: a theoretical and empirical investigation. J Polit Econ 81(3):521–565
Flury B (1997) A first course in multivariate statistics. Springer, New York
Gershenfeld N (1997) Nonlinear inference and cluster-weighted modeling. Ann NY Acad Sci 808(1):18–24
Grün B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35
Hennig C (2000) Identifiablity of models for clusterwise linear regression. J Classif 17(2):273–296
Ingrassia S, Minotti SC, Vittadini G (2012) Local statistical modeling via the cluster-weighted approach with elliptical distributions. J Classif 29(3):363–401
Ingrassia S, Minotti SC, Punzo A (2014) Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal 71:159–182
Ingrassia S, Punzo A, Vittadini G (2015) The generalized linear mixed cluster-weighted model. J Classif 32 (in press)
Leisch F (2004) FlexMix: a general framework for finite mixture models and latent class regression in \({\sf R}\). J Stat Softw 11(8):1–18
Lindsay BG (1995) Mixture models: theory, geometry and applications. In: NSF-CBMS regional conference series in probability and statistics, vol 5. Institute of Mathematical Statistics, Hayward
Lo Y (2008) A likelihood ratio test of a homoscedastic normal mixture against a heteroscedastic normal mixture. Stat Comput 18(3):233–240
McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. J R Stat Soc Ser C (Appl Stat) 36(3):318–324
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
McLachlan GJ, Bean RW, Ben-Tovim Jones L (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution. Comput Stat Data Anal 51(11):5327–5338
Meng XL, van Dyk D (1997) The EM algorithm—an old folk-song sung to a fast new tune. J R Stat Soc Ser B (Stat Methodol) 59(3):511–567
Punzo A (2014) Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Stat Model 14(3):257–291
Punzo A, Ingrassia S (2015) Parsimonious generalized linear Gaussian cluster-weighted models. In: Morlini I, Minerva T, Palumbo F (eds) Advances in statistical models for data analysis, studies in classification, data analysis and knowledge organization. Springer, Switzerland
Punzo A, Browne RP, McNicholas PD (2014) Hypothesis testing for parsimonious Gaussian mixture models. http://arxiv.org/abs/1405.0377
R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Sakamoto Y, Ishiguro M, Kitagawa G (1983) Akaike information criterion statistics. Reidel, Boston
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Subedi S, Punzo A, Ingrassia S, McNicholas PD (2013) Clustering and classification via cluster-weighted factor analyzers. Adv Data Anal Classif 7(1):5–40
Vandaele W (1987) Participation in illegitimate activities: Ehrlich revisited. Report, U.S. Department of Justice, National Institute of Justice
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: Estimation of the parameters for the linear \(t\) CWM with \(t\)-factor analyzers using the AECM algorithm
Let \({\mathcal {S}}=\{({\varvec{x}}_i',y_i)';i=1,\ldots ,n\}\) be a sample of size \(n\). In the EM framework (Dempster et al. 1977), the generic observation \(({\varvec{x}}_i',y_i)'\) is viewed as being incomplete and its complete counterpart is given by \(({\varvec{x}}_i',y_i,{\varvec{u}}_{ig}',w_{ig},\varvec{z}_i')',\) where \(\varvec{z}_i\) is the component-label vector such that \(z_{ig}=1\) if \(({\varvec{x}}_i',y_i)'\) comes from component \(g\) and \(z_{ig}=0\) otherwise. The idea of the AECM algorithm (Meng and van Dyk 1997) is to partition \({\varvec{\theta }}\), say \({\varvec{\theta }}=({\varvec{\theta }}'_1,{\varvec{\theta }}'_2)'\), in such a way that the likelihood is easy to maximize for \({\varvec{\theta }}_1\) given \({\varvec{\theta }}_2\) and vice versa. The AECM algorithm consists of two cycles, with an E-step and a CM-step for each cycle. The two CM-steps correspond to the partition of \({\varvec{\theta }}\) into \({\varvec{\theta }}_1\) and \({\varvec{\theta }}_2\). The algorithm is iterated until convergence.
1.1 First cycle
Here, \({\varvec{\theta }}_1=\{\pi _g,{\varvec{\mu }}_g,{\varvec{\beta }}_g,\sigma ^2_g,\nu _g;g=1, \ldots , G\}\), where the missing data are the unobserved group labels \(\varvec{z}_i\), \(i=1,\ldots ,n\), and \(w_{ig}\). The complete-data likelihood is
The E-step on the first cycle of the \((k+1)\)th iteration requires the calculation of \(Q_1({\varvec{\theta }}_1; {\varvec{\theta }}^{(k)}) = {\mathbb {E}}_{{\varvec{\theta }}^{(k)}}\left[ l_{c1} ({\varvec{\theta }}_1)|{\mathcal {S}}\right] \), the expected complete-data log-likelihood given the observed data and \({\varvec{\theta }}^{(k)}\). In practice, it requires calculating \({\mathbb {E}}_{{\varvec{\theta }}^{(k)}}\left[ Z_{ig}|{\mathcal {S}}\right] \) and \({\mathbb {E}}_{{\varvec{\theta }}^{(k)}}\left[ W_{ig}|{\mathcal {S}}\right] \); this step is achieved by respectively replacing each \(z_{ig}\) by \(z_{ig}^{(k+1)} = \frac{\tau _{ig}}{\sum _{j=1}^G\tau _{ij}},\) where
and each \(w_{ig}\) by
where \({\varvec{\varSigma }}^{(k)}_g={\varvec{\varLambda }}_g^{(k)}{\varvec{\varLambda }}_g^{(k)'}+{\varvec{\varPsi }}_g^{(k)}\). For the M-step, the maximization of \(L_{c1}({\varvec{\theta }}_1)\) yields
With regard to updating for the degrees of freedom \(v_g\), their formulation does not exist in closed form. McLachlan et al. (2007) show that \(v_g^{(k+1)}\) is the solution of the equation
where \(\psi (\cdot )\) denotes the digamma function and \(n_g^{(k+1)}=\sum _{i=1}^nz_{ig}^{(k+1)}\). Following the notation in McLachlan and Peel (2000), we set \({\varvec{\theta }}^{(k+1/2)}=\{{\varvec{\theta }}_1^{(k+1)},{\varvec{\theta }}_2^{(k)}\}\).
1.2 Second cycle
Here, \({\varvec{\theta }}_2=\left\{ {\varvec{\varSigma }}_g ; g=1, \ldots , G \right\} = \left\{ {\varvec{\varLambda }}_g,{\varvec{\varPsi }}_g;g=1, \ldots , G\right\} \), where the missing data are the unobserved group labels \(\varvec{z}_i\) and the latent factors \(\varvec{u}_{ig}\), \(i=1,\ldots ,n\) and \(g=1,\ldots ,G\). Therefore, the complete-data likelihood is
where \(Y\) is conditionally independent of \({\varvec{U}}\) given \({\varvec{X}}={\varvec{x}}\).
The E-step on the second cycle of the \((k+1)\)th iteration requires the calculation of \(Q_2({\varvec{\theta }}_2; {\varvec{\theta }}^{(k+1/2)}) = {\mathbb {E}}_{{\varvec{\theta }}^{(k+1/2)}} \left[ l_{c2} ({\varvec{\theta }}_2)|{\mathcal {S}}\right] \). This involves calculating the following conditional expectations: \({\mathbb {E}}_{{\varvec{\theta }}^{(k+1/2)}}(Z_{ig}|{\mathcal {S}})\), \({\mathbb {E}}_{{\varvec{\theta }}^{(k+1/2)}}(W_{ig}|{\mathcal {S}})\), \({\mathbb {E}}_{{\varvec{\theta }}^{(k+1/2)}}(Z_{ig} W_{ig}{\varvec{U}}_{ig} | {\mathcal {S}})\), and \({\mathbb {E}}_{{\varvec{\theta }}^{(k+1/2)}} (Z_{ig}W_{ig}{\varvec{U}}_{ig} {\varvec{U}}'_{ig} |{\mathcal {S}})\). We have
and
where
Hence, the \(g\)th term of the expected complete-data log-likelihood \(Q_2({\varvec{\theta }}_2; {\varvec{\theta }}^{(k+1/2)})\) is
where \(\text {C}({\varvec{\theta }}_1^{(k+1)})\) includes the terms in the complete-data log-likelihood that do not depend on \({\varvec{\theta }}_2\). Then, maximization of (8) with respect to \({\varvec{\varLambda }}_g\) and \({\varvec{\varPsi }}_g\) means that they must satisfy
Therefore,
From (9), we get \({\varvec{\varLambda }}_g^{(k+1)} = {\varvec{S}}^{(k+1)}_g {\varvec{\gamma }}^{(k)'}_g\left( {\varvec{\varTheta }}_g^{(k)}\right) ^{-1}\) and, substituting in (10), we obtain
which yields \( {\varvec{\varPsi }}_g^{(k+1)} =\text {diag} ( {\varvec{S}}^{(k+1)}_g- {\varvec{\varLambda }}_g^{(k+1)} {\varvec{\gamma }}^{(k)'}_g{\varvec{S}}^{(k+1)}_g ). \)
Appendix 2: Estimation of the parameters for the linear \(t\) CWM with common \(t\)-factor analyzers using the AECM algorithm
Analogous to “Appendix 1”, the AECM algorithm consists of two cycles, with an E-step and a CM-step for each cycle, which are iterated until convergence is achieved.
1.1 First cycle
Here, \({\varvec{\theta }}_1=\left\{ \pi _g,{\varvec{\xi }}_g,{\varvec{\beta }}_g,\sigma ^2_g,v_g;g=1, \ldots , G\right\} \), where the missing data are the unobserved group labels \(\varvec{z}_i\), \(i=1,\ldots ,n\). The complete-data likelihood is
where \({\varvec{\varSigma }}_g={\varvec{\varLambda }}{\varvec{\varOmega }}_g{\varvec{\varLambda }}'+{\varvec{\varPsi }}\).
Similar to “First cycle” section of Appendix 1, \({\mathbb {E}}_{{\varvec{\theta }}^{(k)}}\left[ Z_{ig}|{\mathcal {S}}\right] \) is calculated by replacing each \(z_{ig}\) by \(z_{ig}=\frac{\tau _{ig}}{\sum _{j=1}^G\tau _{ij}}\) where
and \({\mathbb {E}}\left[ W_{ig}|{\mathcal {S}}\right] \) is computed by replacing each \(w_{ig}\) by
where \({\varvec{\varSigma }}_g^{(k)}={\varvec{\varLambda }}^{(k)}{\varvec{\varOmega }}_g^{(k)}{\varvec{\varLambda }}^{(k)'}+{\varvec{\varPsi }}^{(k)}\). For the M-step, the maximization of this complete-data log-likelihood yields
As in “First cycle” section of Appendix 1, the degrees of freedom \(v_g\) are updated according to (6), and we set \({\varvec{\theta }}^{(k+1/2)}=\left\{ {\varvec{\theta }}_1^{(k+1)},{\varvec{\theta }}_2^{(k)}\right\} \).
1.2 Second cycle
Here, \({\varvec{\theta }}_2=\left\{ {\varvec{\varSigma }}_g ; g=1, \ldots , G \right\} = \left\{ {\varvec{\varLambda }},{\varvec{\varOmega }}_g,{\varvec{\varPsi }};g=1, \ldots , G\right\} \), where the missing data are the unobserved group labels \(\varvec{z}_i\) and the latent factors \(\varvec{u}_{ig}\), \(i=1,\ldots ,n\) and \(g=1,\ldots ,G\). Therefore, the complete-data likelihood is
because \(Y\) is conditionally independent of \({\varvec{U}}\) given \({\varvec{X}}={\varvec{x}}\). The E-step on the second cycle of the \((k+1)\)th iteration requires the calculation of \(Q_2({\varvec{\theta }}_2; {\varvec{\theta }}^{(k+1/2)})={\mathbb {E}}_{{\varvec{\theta }}^{(k+1/2)}}\left[ l_{c2}({\varvec{\theta }}_2)|{\mathcal {S}}\right] \). Based on (5), the expectations involved in calculating \(Q_2({\varvec{\theta }}_2; {\varvec{\theta }}^{(k+1/2)})\) are given by
where
and
Alternatively, the expectation in (11 12 13) can be written as
Then, the \(g\)th term of the expected complete-data log-likelihood \(Q_2({\varvec{\theta }}_2; {\varvec{\theta }}^{(k+1/2)})\) becomes
where \(\text {C}({\varvec{\theta }}_1^{(k+1)})\) denotes the terms in the complete-data log-likelihood that do not depend on \({\varvec{\theta }}_2\). The expected complete-data log-likelihood from the second cycle is maximized for \({\varvec{\varOmega }}_g\) by satisfying
which yields
where
The expected complete-data log-likelihood from the second cycle is maximized for \({\varvec{\varPsi }}\) using (16) and satisfies
This yields
The expected complete-data log-likelihood from the second cycle is maximized for \({\varvec{\varLambda }}\) using (11 12 13) and satisfies
which yields
Rights and permissions
About this article
Cite this article
Subedi, S., Punzo, A., Ingrassia, S. et al. Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl 24, 623–649 (2015). https://doi.org/10.1007/s10260-015-0298-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-015-0298-7