Abstract
We propose a parsimonious extension of the classical latent class model to cluster categorical data by relaxing the conditional independence assumption. Under this new mixture model, named conditional modes model (CMM), variables are grouped into conditionally independent blocks. Each block follows a parsimonious multinomial distribution where the few free parameters model the probabilities of the most likely levels, while the remaining probability mass is uniformly spread over the other levels of the block. Thus, when the conditional independence assumption holds, this model defines parsimonious versions of the standard latent class model. Moreover, when this assumption is violated, the proposed model brings out the main intra-class dependencies between variables, summarizing thus each class with relatively few characteristic levels. The model selection is carried out by an hybrid MCMC algorithm that does not require preliminary parameter estimation. Then, the maximum likelihood estimation is performed via an EM algorithm only for the best model. The model properties are illustrated on simulated data and on three real data sets by using the associated R package CoModes. The results show that this model allows to reduce biases involved by the conditional independence assumption while providing meaningful parameters.
Similar content being viewed by others
Notes
Note that the repartition of the variables into blocks is identical between classes. This choice is motivated by reasons of identifiability and interpretation that we will detail later.
References
Agresti A (2002) Categorical data analysis, vol 359. Wiley, New York
Allman E, Matias C, Rhodes J (2009) Identifiability of parameters in latent structure models with many observed variables. Ann Stat 37(6A):3099–3132
Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factor analysis: a unified approach, vol 899. Wiley, New York
Biernacki C, Celeux G, Govaert G (2010) Exact and Monte Carlo calculations of integrated likelihoods for the latent class model. J Stat Plan Inference 140(11):2991–3002
Bretagnolle V (2007) Personal communication. source: Museum
Celeux G, Govaert G (1991) Clustering criteria for discrete data and latent class models. J Classif 8(2):157–176
Chavent M, Kuentz V, Saracco J (2010) A partitioning method for the clustering of categorical variables. Classification as a tool for research. Springer, Berlin Heidelberg, pp 91–99
Choirat C, Seri R (2012) Estimation in discrete parameter models. Stat Sci 27(2):278–293
Czerniak J, Zarzycki H (2003) Application of rough sets in the presumptive diagnosis of urinary system diseases. Artifical intelligence and security in computing systems, ACS’2002 9th International Conference Proceedings, pp 41–51
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. Ser B (Method) 39(1):1–38
Espeland M, Handelman S (1989) Using Latent class models to characterize and assess relative error in discrete measurements. Biometrics 45(2):587–599
Gollini I, Murphy T (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588
Goodman L (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61(2):215–231
Govaert G (2010) Data analysis, vol 136. Wiley, New york
Hagenaars J (1988) Latent structure models with direct effects between indicators local dependence models. Sociol Methods & Res 16(3):379–405
Hand D, Yu K (2001) Idiot’s bayes not so stupid after all? Int Stat Rev 69(3):385–398
Huang J, Ng M, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. Pattern Anal Mach Intell, IEEE Trans On 27(5):657–668
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Jajuga K, Sokołowski A, Bock H (2002) Classification, clustering and data analysis: recent advances and applications. Springer, Berlin
Jorgensen M, Hunt L (1996) Mixture model clustering of data sets with categorical and continuous variables. In Proc Conf ISIS 96:375–384
Kruskal J (1976) More factors than subjects, tests and treatments: An indeterminacy theorem for canonical decomposition and individual differences scaling. Psychometrika 41(3):281–293
Kruskal J (1977) Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl 18(2):95–138
Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2014) Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library. J Stat Softw (in press)
McLachlan G, Krishnan T (1997) The EM algorithm. Applied probability and statistics., Probability and statisticsWiley, New York
McLachlan G, Peel D (2000) Finite mixutre models. Applied probability and statistics., Probability and statisticsWiley, New York
Moran M, Walsh C, Lynch A, Coen R, Coakley D, Lawlor B (2004) Syndromes of behavioural and psychological symptoms in mild alzheimer’s disease. Int J Geriatr Psychiatry 19(4):359–364
Qu Y, Tan M, Kutner M (1996) Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 52(3):797–810
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Van Hattum P, Hoijtink H (2009) Market segmentation using brand strategy research: bayesian inference with respect to mixtures of log-linear models. J Classif 26(3):297–328
Vermunt J (2003) Multilevel latent class models. Sociol Method 33(1):213–239
Author information
Authors and Affiliations
Corresponding author
Additional information
Downloadable at https://r-forge.r-project.org/R/?group_id=1809.
Appendices
Appendix: Generic identifiability of CMM
When \(d\ge 3\), CMM is generically identifiable (i.e., the parameter space where the model is not identifiable has a Lebesgue measure equal to zero). The demonstration is based on results of Allman et al. (2009) which use the Kruskal theorem (Kruskal 1977, 1976). The demonstration is cut into three steps: reminder of the Kruskal results for the three-way tables, demonstration of the model generic identifiability when \(d=3\), then its extension when model has more than three blocks.
Kruskal results For a matrix M, the Kruskal rank of M, denoted by \(\text {rank}_K\; M\) is the largest number I such that every set of I rows of M are linearly independent.
Theorem 1
(Kruskal 1977, 1976). Let \(I_j=\text {rank}_K\; M_j\). If
then the tensor \([M_1, M_2, M_3]\) uniquely determines the \(M_j\), up to simultaneous permutation and rescaling rows.
Generic identifiability of CMM with three blocks Let \(k_0=\underset{k}{\text {argmin }} u_{kj}\) and the matrix \(M_j\) where
By denoting by \(\xi _j=\underset{k}{\min }\; u_{kj} +1\), generically, we have
Corollary 1
The parameters of CMM with three blocs are generically identifiable, up to label swapping, provided:
Generic identifiability of CMM with more than three blocks In the same way that Allman et al. (2009), we generalize the result with d blocks by observing that d blocks of categorical variables can be combined into three categorical variables. Thus, we can apply the Kruskal theorem.
Corollary 2
We consider a CMM with d blocks where \(d\ge 3\). If there exists a tri-partition of the set \(\{1,\ldots ,d\}\) into three disjoint non empty subsets \(S_1\), \(S_2\) and \(S_3\), such that \(\gamma _i=\prod _{j\in S_i}\xi _j\) with
then the model parameters are generically identifiable up to label swapping.
Approximation of \(I(u_{kj})\)
First, we define a new parametrization of the block distribution facilitating the integrated complete-data likelihood computation and the prior distribution related to this new block parametrization. Second, we underline the relationship between the embedded models. We conclude by the integrated complete-data likelihood computation, which is the target result.
1.1 New parametrization of the block distribution
Without loss of generality, we assume that the elements of \(\varvec{\delta }_{kj}\) are ordered by decreasing values of the probability mass associated to them and we introduce the new parametrization of \(\varvec{\alpha }_{kj}\) denoted by \(\varvec{\varepsilon }_{kj}=(\varepsilon _{kjh};h=1,\ldots ,u_{kj})\) where \(\varvec{\varepsilon }_{kj} \in \mathcal {E}_{kj}=\left[ \frac{1}{m_j};1\right] \times \ldots \times \left[ \frac{1}{\text {m}_j - u_{kj}};1\right] \) and where \(\varepsilon _{kjh}\) is defined by
Each \(\varepsilon _{kjh}\) follows a truncated beta distribution on the interval \(\left[ \frac{1}{\text {m}_j-h+1},1\right] \), so
The conditional probability of \(\mathbf x ^j=(\mathbf x _i^j;i=1,\ldots ,n)\) is
1.2 Relation between embedded models
Let the model with \(u_{kj}^{\ominus }\) modes and the parameters \((\varvec{\delta }_{kj}^{\star \ominus },\varvec{\varepsilon }_{kj}^{\ominus })\) and the model with \(u_{kj}\) modes and the parameters \((\varvec{\delta }^{\star }_{kj},\varvec{\varepsilon }_{kj})\) such as \(u_{kj}^{\ominus }=u_{kj}-1\) and such as the \(u_{kj}^{\ominus }\) modes having the largest probabilities have the same locations (\(\forall h\in \varvec{\delta }_{kj}^{\star \ominus },\; h\in \varvec{\delta }_{kj}^{\star }\)) and the same probability masses \((\varepsilon _{kjh}^{\ominus }=\varepsilon _{kjh},\; h<u_{kj})\). These embedded models follow this relation
1.3 Integrated complete-data likelihood
The integrated complete-data likelihood is finally approximated, by neglecting the sum over the discrete parameters of the modes locations and by performing the exact computation on the continuous parameters, by
Proof
If, for the model with \(u_{kj}-1\) modes, the best modes locations are known and given by \(\varvec{\delta }_{kj}^{\star \ominus }\) then the conditional probability of \(\mathbf x ^j\) for a model with \(u_{kj}\) modes is
where \(\Delta _{kj}=\{\varvec{\delta }_{kj}: \varvec{\delta }_{kj}^{\star \ominus } \subset \varvec{\delta }_{kj} \text { and } \text {card}(\varvec{\delta }_{kj})= u_{kj} \}\). Thus, by approximating this sum by its maximum element, we obtain that
As \(p(\mathbf x ^{j}|\mathbf z ,u_{kj}=0)=(m_j)^{-n_k}\), by applying recursively (26), we obtain that
\(\square \)
Details on the model sampling
The sampling of \((\varvec{s}^{[s+1]},\varvec{u}^{[s+1]})\) from (10) is performed in two steps. Firstly, a new repartition of the variables into blocks and the mode number of the modified blocks, respectively denoted by \( \varvec{s}^{[s+1]}\) and \(\varvec{u}^{[s+1/2]}\), are sampled by one iteration of a Metropolis-Hastings algorithm. Secondly, the mode number of each block is sampled by one MCMC iteration. Thus, the sampling of \(\varvec{m}^{[s+1]}\) is decomposed into the two following steps
Thus, this chain has \(p(\varvec{s},\varvec{u}|g,\mathbf x ,\mathbf z ^{[s+1]})\) as stationary distribution.
1.1 Metropolis-Hastings algorithm to sample from (31)
This sampling is performed by one iteration of the Metropolis-Hastings algorithm divided into two steps. Firstly, the proposal distribution \(q(.;\varvec{m}^{[s]})\) generates a candidate \(\varvec{m}^{\star }=(g,\varvec{s}^{\star },\varvec{u}^{\star })\). Secondly \(\varvec{m}^{[s+1]}\) is sampled according to the acceptance probability \(\mu ^{[s]}\) defined by
Note that the computation of \(\mu ^{[s]}\) involves to compute the integrated complete-data likelihood defined by (17). The sampling of \(\varvec{m}^{[s+1/2]}\) is written as
The proposal distribution \( q(.;\varvec{m}^{[s]})\) samples \(\varvec{m}^{\star }\) in two steps. The first step changes the block affectation of one variable. In practice, \(\varvec{s}^{\star }\) is uniformly sampled in \(V(\varvec{s}^{[s]})=\{\varvec{s}: \exists ! b \text { as } b\in \varvec{s}_j^{[s]} \text { and } b \notin \varvec{s}_j\}\). The second step uniformly samples the mode numbers among all its possible values for the modified blocks while \(u^{\star }_{kj}=u^{[s]}_{kj}\) for non-modified blocks (i.e., j as \(\varvec{s}_j^{[s]}=\varvec{s}_j^{\star }\)).
1.2 MCMC algorithm to sample \(\varvec{u}^{[s+1]}\)
This step allows to increase or decrease the mode number of each block by one at each iteration. So, \(u_{kj}^{[s+1]}\) is sampled according to \(p(u_{kj} | \varvec{m}^{[s+1/2]},\mathbf x ,\mathbf z ^{[s+1]})\) defined by
Rights and permissions
About this article
Cite this article
Marbac, M., Biernacki, C. & Vandewalle, V. Latent class model with conditional dependency per modes to cluster categorical data. Adv Data Anal Classif 10, 183–207 (2016). https://doi.org/10.1007/s11634-016-0250-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0250-1
Keywords
- Categorical data
- Clustering
- Integrated complete-data likelihood
- MCMC algorithm
- Mixture models
- Model selection