Abstract
The problem of clustering subpopulations on the basis of samples is considered within a statistical framework: a distribution for the variables is assumed for each subpopulation and the dissimilarity between any two populations is defined as the likelihood ratio statistic which compares the hypothesis that the two subpopulations differ in the parameter of their distributions to the hypothesis that they do not. A general algorithm for the construction of a hierarchical classification is described which has the important property of not having inversions in the dendrogram. The essential elements of the algorithm are specified for the case of well-known distributions (normal, multinomial and Poisson) and an outline of the general parametric case is also discussed. Several applications are discussed, the main one being a novel approach to dealing with massive data in the context of a two-step approach. After clustering the data in a reasonable number of ‘bins’ by a fast algorithm such as k-Means, we apply a version of our algorithm to the resulting bins. Multivariate normality for the means calculated on each bin is assumed: this is justified by the central limit theorem and the assumption that each bin contains a large number of units, an assumption generally justified when dealing with truly massive data such as currently found in modern data analysis. However, no assumption is made about the data generating distribution.

















Similar content being viewed by others
Notes
The proof of this and the other propositions of this paper are given in Sect. 7.
References
Conversano C (2002) Bagged mixtures of classifiers using model scoring criteria. Pattern Anal Appl 5(4):351–362
Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256
Yang J, Ye H, Zhang D (2004) A new lda-kl combined method for feature extraction and its generalisation. Pattern Anal Appl 7(1):40–50. doi:10.1007/s10044-004-0205-6
Masip D, Kuncheva LI, Vitria J (2005) An ensemble-based method for linear feature extraction for two-class problems. Pattern Anal Appl 227–237
Chou C-H, Su M-C (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7(2):205–220. doi:10.1007/s10044-004-0218-1
Gyllenberg M, Koski T, Lund T, Nevalainen O (2000) Clustering by adaptive local search with multiple search operators. Pattern Anal Appl 3(4):348–357. doi:10.1007/s100440070006
Omran MGH, Salman A, Engelbrech AP (2006) Dynamic clustering using particle swarm optimization with application in image segmentation. Pattern Anal Appl 34:332–344. doi:10.1007/978-3-540-34956-3_6
Frigui H (2005) Unsupervised learning of arbitrarily shaped clusters using ensembles of gaussian models. Pattern Anal Appl 8(1–2):32–49. doi:10.1007/s10044-005-0240-y
Fränti P, Kivijärvi J (2000) Randomised local search algorithm for the clustering problem. Pattern Anal Appl 3:358–369
Calinski T, Corsten LCA (1985) Clustering means in anova by simultaneous testing. Biometrics 41:39–48
Gabriel KR (1964) A procedure for testing the homogeneity of all sets of means in analysis of variance. Biometrics 20(3):459–477. doi: http://dx.doi.org/10.2307/2528488
Gabriel KR (1969) Simultaneous test procedures—some theory of multiple comparisons. Ann Math Stat 40(1):224–250
Scott AJ, Symons MJ (1971) Clustering methods based on likelihood ratio criteria. Biometrics 27:387–397
Banfield JD, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 49:803–821
Bensmail H, Celeux G, Raftery AE, Robert CP (1997) Inference in model-based cluster analysis. Stat Comput 7(1):1–10. doi: http://dx.doi.org/10.1023/A:1018510926151
Fraley C (1998) Algorithms for model-based gaussian hierarchical clustering. SIAM J Sci Comput 20:270–281
Fraley C, Raftery AE (1999) Mclust: software for model-based cluster analysis. J Classification 16:297–306
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc 39(1):1–38
Gilles Celeux, Gérard Govaert (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332, ISSN 0167-9473. doi:http://dx.doi.org/10.1016/0167-9473(92)90042-E
Maitra Ranjan (1998) Clustering massive datasets. Technical report, http://www.math.umbc.edu/∼maitra/papers/jsm98.ps
Murtagh F (2002) Clustering in massive data sets, chap 14. Kluwer, Dordrecht, pp 401–545
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. An introduction to cluster analysis. Wiley series in probability and mathematical statistics. Applied probability and statistics. Wiley, New York
Barndorff-Nielsen OE (1978) Information and exponential families in statistical theory. Wiley, Chichester
Cox DR, Hinckley DV (1974) Theoretical statistics. Chapman and Hall, London
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London
Bradley PS, Fayyad UM (1998) Refining initial points for k-Means clustering. In: Proceedings of the 15th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 91–99
Bradley PS, Fayyad U, Reina C (1998) Scaling cluster algorithms to large databases. In: Proceedings of the 4th international conference on knowledge discovery and data mining. American Association for Artificial Intelligence Press, Menlo Park, pp 9–15
Good P (2005) Permutation, parametric, and bootstrap tests of hypotheses. A practical guide to resampling methods for testing hypotheses. Springer, Heidelberg
Zhang T, Ramakrishnan R, Livny M (1997) Birch: A new data clustering algorithm and its applications. Data Mining and Knowl Discov 1(2):141–182
Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) 20th International Conference on Very Large Data Bases, 12–15 September 1994, Santiago, Chile proceedings. Morgan Kaufmann Publishers, Los Altos, CA 94022, pp 144–155
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD international conference on management of data. ACM, New York, pp 94–105
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD international conference on management of data. ACM, New York, pp 73–84
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the 2nd international conference on knowledge discovery and data mining, Portland, pp 226–231
Ester M, Kriegel H-P, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th international conference on very large data bases, New York City, NY, pp 323–333
Nittel S, Leung KT, Braverman A (2004) Scaling clustering algorithms for massive data sets using data streams. In: Proceedings of the 20th international conference on eata engineering (ICDE’04)
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorial attributes. In: Proceeding of the 15th international conference on data engineering, Sydney, Australia
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Chris Fraley, Adrian E Raftery (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588. http://citeseer.nj.nec.com/article/fraley98how.html
Fayyad U, Smyth P (1996) From massive datasets to science catalogs: Applications and challenges. http://www.citeseer.ist.psu.edu/fayyad95from.html
Gordon AD (1986) Links between clustering and assignment procedures. In: Proceedings of computational statistics, pp 149–156. doi: http://dx.doi.org/10.1016/0167-9473(92)90042-E
Cutting DR, Pedersen JO, Karger D, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, pp 318–329. http://citeseer.ist.psu.edu/cutting92scattergather.html
Tantrum J, Murua A, Stuetzle W (2002) Hierarchical model-based clustering of large datasets through fractionation and refractionation. http://citeseer.nj.nec.com/572414.html
SAS Institute Inc (2004) SAS/STAT® 9.1 User’s Guide. SAS Institute Inc, Cary. ISBN 1-59047-243-8
Ciampi A, Lechevallier Y (2000) Clustering large, multi-level data sets: an approach based on kohonen self organizing maps. In: Principles of data mining and knowledge discovery: 4th european conference, PKDD 2000, Lyon, France. Proceedings, volume 1910/2000 of Lecture Notes in Computer Science. Springer, Heidelberg, pp 353–358
Lechevallier Y, Ciampi A (2007) Multilevel clustering for large databases. In: Auget J-L, Balakrishnan N, Mesbah M, Molenberghs G (eds) Advances in statistical methods for the health sciences, statistics for industry and technology, chap 10. Applied Probality and Statistics, Springer edition, pp 263–274
Posse C (2001) Hierarchical model-based clustering for large datasets. J Comput Graph Stat 10:464–486
Guo H, Renaut R, Chen K, Reiman E (2003) Clustering huge data sets for parametric pet imaging. Biosystems 71:81–92
Frigui H (2004) Fuzzy information. Processing NAFIPS’04. IEEE annual meeting, 27–30 June 2004, vol 2, pp 967–972. doi:10.1109/NAFIPS.2004.1337437
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–569
Meila M (2002) Comparing clusterings. Technical Report 418, Department of Statistics, University of Washington
Singh S (1998) 2d spiral pattern recognition with possibilistic measures. Pattern Recogn Lett 19(2):141–147, ISSN 0167-8655. doi: http://dx.doi.org/10.1016/S0167-8655(97)00163-3
Acknowledgements
We wish to thank Dr. F. Clavel for having shared the EPIC French data with us and Dr. E. Riboli for general assistance in becoming familiar with EPIC. The authors gratefully acknowledge the hospitality of the Department of Epidemiology and Statistics members during M. Castejón and A. González visits to McGill University. M. Castejón also thanks the hospitality of INRIA members during his visit. We gratefully acknowledge support from the Ministerio de Educación y Ciencia de España, Dirección General de Investigación, by means of the DPI2006-14784 and DPI2007-61090 research projects.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs
Appendix: Proofs
Proof of Proposition 1
Since d(A i , A j ) > 0 we have that f(A r ) > 0, r = 1, 2, ..., M. Thus f(A i ∪ A j ) > f(A i ) + f(A j ) and obviously f(A i ∪ A j ) > max {f(A i ), f(A j )}
Proof of Proposition 2
Now:
Therefore:
where we have used
Notice also that we can write
owing to the conmutativity of symmetric matrices.
Proof of Proposition 3
Let us develop the first term within the curly bracket above:
Now notice that the last term is zero. This follows from:
and
since by definition
We now have,
and we can proceed as in proposition 2. We write to avoid confusion
whence
and therefore
and similarly
Substituting in the expression for d(A i , A j ) above (Eq. 54, p. 36):
where we have used the commutativity of symmetric matrices, the definition \({{\mathcal{U}}}_{A_i}=\sum_{r \in A_i} V_r^{-1},\) and \({{\mathcal{U}}}_{A_i \cup A_j}={{\mathcal{U}}}_{A_i} + {{\mathcal{U}}}_{A_j}.\) Finally noting that:
we have
Rights and permissions
About this article
Cite this article
Ciampi, A., Lechevallier, Y., Limas, M.C. et al. Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets. Pattern Anal Applic 11, 199–220 (2008). https://doi.org/10.1007/s10044-007-0088-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-007-0088-4