Factor probabilistic distance clustering (FPDC): a new clustering method

Tortora, Cristina; Summa, Mireille Gettler; Marino, Marina; Palumbo, Francesco

doi:10.1007/s11634-015-0219-5

Factor probabilistic distance clustering (FPDC): a new clustering method

Regular Article
Published: 26 October 2015

Volume 10, pages 441–464, (2016)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Cristina Tortora¹,
Mireille Gettler Summa²,
Marina Marino³ &
…
Francesco Palumbo⁴

698 Accesses
Explore all metrics

Abstract

Factor clustering methods have been developed in recent years thanks to improvements in computational power. These methods perform a linear transformation of data and a clustering of the transformed data, optimizing a common criterion. Probabilistic distance (PD)-clustering is an iterative, distribution free, probabilistic clustering method. Factor PD-clustering (FPDC) is based on PD-clustering and involves a linear transformation of the original variables into a reduced number of orthogonal ones using a common criterion with PD-clustering. This paper demonstrates that Tucker3 decomposition can be used to accomplish this transformation. Factor PD-clustering alternatingly exploits Tucker3 decomposition and PD-clustering on transformed data until convergence is achieved. This method can significantly improve the PD-clustering algorithm performance; large data sets can thus be partitioned into clusters with increasing stability and robustness of the results. Real and simulated data sets are used to compare FPDC with its main competitors, where it performs equally well when clusters are elliptically shaped but outperforms its competitors with non-Gaussian shaped clusters or noisy data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A Novel Bayesian probabilistic distance clustering algorithm

Article 14 November 2024

Clustering and dimension reduction for mixed variables

Article 11 March 2019

FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

Article Open access 15 May 2024

Notes

Snow leopard, Ram 4 GB 1067 MHz DDR3 RAM, processor 2.26 GHz Intel Core 2 Duo.

References

Andersson CA, Bro R (2000) The N-way toolbox for MATLAB. Chemom Intell Lab Syst 52(1):1–4
Article Google Scholar
Andrews JL, McNicholas PD (2011) Extending mixtures of multivariate t-factor analyzers. Stat Comput 21(3):361–373
Article MathSciNet MATH Google Scholar
Arabie P, Hubert L (1994) Cluster analysis in marketing research. In: Bagozzi R (ed) Advanced methods in marketing research. Blackwell, Oxford, pp 160–189
Google Scholar
Ben-Israel A, Iyigun C (2008) Probabilistic d-clustering. J Classif 25(1):5–26
Article MathSciNet MATH Google Scholar
Bezdek J (1974) Numerical taxonomy with fuzzy sets. J Math Biol 1(1):57–71
Article MathSciNet MATH Google Scholar
Bock HH (1987) On the interface between cluster analysis, principal component analysis, and multidimensional scaling. Multivar Stat Model Data Anal 8:17–34
Article MathSciNet MATH Google Scholar
Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat Comput 22(1):301–324
Article MathSciNet MATH Google Scholar
Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78
Article MathSciNet MATH Google Scholar
Campbell JG, Fraley F, Murtagh F, Raftery AE (1997) Linear flaw detection in woven textiles using model-based clustering. Pattern Recogn Lett 18:1539–1548
Ceulemans E, Kiers HAL (2006) Selecting among three-mode principal component models of different types and complexities: a numerical convex hull based method. Br J Math Stat Psychol 59(1):133–150
Article MathSciNet Google Scholar
Chiang M, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40
Article MathSciNet MATH Google Scholar
Core Team R (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Craen S, Commandeur J, Frank L, Heiser W (2006) Effects of group size and lack of sphericity on the recovery of clusters in k-means cluster analysis. Multivar Behav Res 41(2):127–145
Article Google Scholar
De Sarbo WS, Manrai AK (1992) A new multidimensional scaling methodology for the analysis of asymmetric proximity data in marketing research. Mark Sci 11(1):1–20
Article Google Scholar
De Soete, G. and J. D. Carroll (1994). k-means clustering in a low-dimensional Euclidean space. In: Diday E, Lechevallier Y, Schader M et al (eds) New approaches in classification and data analysis. Springer, Heidelberg, pp 212–219
Franczak BC, McNicholas PD, Browne RB, Murray PM (2013) Parsimonious shifted asymmetric Laplace mixtures. arXiv:1311:0317
Franczak BC, Tortora C, Browne RP, McNicholas PD (2015) Unsupervised learning via mixtures of skewed distributions with hypercube contours. Pattern Recognit Lett 58:69–76
Article Google Scholar
Ghahramani Z, Hinton GE (1997) The EM algorithm for mixtures of factor analyzers. Crg-tr-96-1, Univ. Toronto, Toronto
Hwang H, Dillon WR, Takane Y (2006) An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents. Psychometrika 71:161–171
Article MathSciNet MATH Google Scholar
Iodice D’Enza A, Palumbo F, Greenacre M (2008) Exploratory data analysis leading towards the most interesting simple association rules. Comput Stat Data Anal 52(6):3269–3281
Article MathSciNet MATH Google Scholar
Iyigun C (2007) Probabilistic distance clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey
Jain AK (2009) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83
Article MathSciNet Google Scholar
Kiers HAL, Der Kinderen A (2003) A fast method for choosing the numbers of components in Tucker3 analysis. Br J MathStat Psychol 56(1):119–125
Article MathSciNet Google Scholar
Kroonenberg PM (2008) Applied multiway data analysis. Ebooks Corporation, Hoboken
Book MATH Google Scholar
Kroonenberg PM, Van der Voort THA (1987) Multiplicatieve decompositie van interacties bij oordelen over de werkelijkheidswaarde van televisiefilms [multiplicative decomposition of interactions for judgments of realism of television films]. Kwantitatieve Methoden 8:117–144
Google Scholar
Lebart A, Morineau A, Warwick K (1984) Multivariate statistical descriptive analysis. Wiley, New York
MATH Google Scholar
Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Adv Data Anal Classif 7(3):241–266
Article MathSciNet MATH Google Scholar
Lin T-I, McLachlan GJ, Lee SX (2013) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. arXiv:1307:1748
Lin T-I (2009) Maximum likelihood estimation for multivariate skew normal mixture models. J Multivar Anal 100:257–265
Article MathSciNet MATH Google Scholar
Lin T-I (2010) Robust mixture modeling using multivariate skew t distributions. Stat Comput 20(3):343–356
Article MathSciNet Google Scholar
Lin T-I, McNicholas PD, Hsiu JH (2014) Capturing patterns via parsimonious t mixture models. Stat Probab Lett 88:80–87
Article MathSciNet MATH Google Scholar
Markos A, Iodice D’Enza A, Van de Velden M (2013) clustrd: methods for joint dimension reduction and clustering. R package version 0.1.2
Maronna RA, Zamar RH (2002) Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4):307–317
Article MathSciNet Google Scholar
McLachlan GJ, Peel D (2000b) Mixtures of factor analyzers. In: Morgan Kaufman SF (ed) Proccedings of the seventeenth international conference on machine learning, pp 599–606
McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388
Article MathSciNet MATH Google Scholar
McLachlan GJ, Peel D (2000a) Finite mixture models. Wiley Interscience, New York
Book MATH Google Scholar
McNicholas PD, Jampani KR, McDaid AF, Murphy TB, Banks L (2011) pgmm: Parsimonious Gaussian Mixture Models. R package version 1:1
McNicholas SM, McNicholas PD, Browne RP (2013) Mixtures of variance-gamma distributions. arXiv:1309.2695
McNicholas PD, Murphy T (2008) Parsimonious Gaussian mixture models. Stat Comput 18(3):285–296
Article MathSciNet Google Scholar
Murray PM, Browne RB, McNicholas PD (2014) Mixtures of skew-t factor analyzers. Comput Stat Data Anal 77:326–335
Article MathSciNet Google Scholar
Palumbo F, Vistocco D, Morineau A (2008) Huge multidimensional data visualization: back to the virtue of principal coordinates and dendrograms in the new computer age. In: Chun-houh Chen WH, Unwin A (eds) Handbook of data visualization. Springer, pp 349–387
Rachev ST, Klebanov LB, Stoyanov SV, Fabozzi FJ (2013) The methods of distances in the theory of probability and statistics. Springer
Rocci R, Gattone SA, Vichi M (2011) A new dimension reduction method: factor discriminant k-means. J Classif 28(2):210–226
Article MathSciNet MATH Google Scholar
Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate t-factor analyzers. Commun Stat Simul Comput 41(4):510–523
Article MathSciNet MATH Google Scholar
Stute W, Zhu LX (1995) Asymptotics of k-means clustering based on projection pursuit. Sankhyā 57(3):462–471
Subedi S, McNicholas PD (2014) Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Adv Data Anal Classif 8(2):167–193
Article MathSciNet Google Scholar
The MathWorks Inc. (2007) MATLAB—The Language of Technical Computing, Version 7.5. The MathWorks Inc., Natick
Timmerman ME, Ceulemans E, Roover K, Leeuwen K (2013) Subspace k-means clustering. Behav Res Methods Res 45(4):1011–1023
Timmerman ME, Ceulemans E, Kiers HAL, Vichi M (2010) Factorial and reduced k-means reconsidered. Comput Stat Data Anal 54(7):1858–1871
Article MathSciNet MATH Google Scholar
Timmerman ME, Kiers HAL (2000) Three-mode principal components analysis: choosing the numbers of components and sensitivity to local optima. Br J Math Stat Psychol 53(1):1–16
Article Google Scholar
Tortora, C. and M. Marino (2014). Robustness and stability analysis of factor PD-clustering on large social datasets. In D. Vicari, A. Okada, G. Ragozini, and C. Weihs (Eds.), Analysis and Modeling of Complex Data in Behavioral and Social Sciences, pp. 273–281. Springer
Tortora C, Gettler Summa M, Palumbo F (2013) Factor PD-clustering. In: Berthold UL, Dirk V (ed) Algorithms from and for nature and life, pp 115–123
Tortora C, McNicholas PD, Browne RP (2015) A mixture of generalized hyperbolic factor analyzers. Adv Data Anal Classif (in press)
Tortora C, McNicholas PD (2014) FPDclustering: PD-clustering and factor PD-clustering. R package version 1.0
Tortora C, Palumbo F (2014) FPDC. MATLAB and Statistics Toolbox Release (2012a) The MathWorks Inc. Natick
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279–311
Article MathSciNet Google Scholar
Vermunt JK (2011) K-means may perform as well as mixture model clustering but may also be much worse: comment on Steinley and Brusco (2011). Psychol Methods 16(1):82–88
Article MathSciNet Google Scholar
Vichi M, Kiers HAL (2001) Factorial k-means analysis for two way data. Comput Stat Data Anal 37:29–64
Article MathSciNet MATH Google Scholar
Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53(8):3194–3208
Article MathSciNet MATH Google Scholar
Vrbik I, McNicholas PD (2014) Parsimonious skew mixture models for model-based clustering and classification. Comput Stat Data Anal 71:196–210
Article MathSciNet Google Scholar
Yamamoto M, Hwang H (2014) A general formulation of cluster analysis with dimension reduction and subspace separation. Behaviormetrika 41:115–129
Article Google Scholar
Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors are grateful to an associate editor and anonymous reviewers for their very helpful comments and suggestions, the cumulative effect of which has been a stronger manuscript.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, McMaster University, Hamilton, Canada
Cristina Tortora
CEREMADE, Université Paris Dauphine, Paris, France
Mireille Gettler Summa
Dipartimento di Scienze Sociali, University of Naples Federico II, Naples, Italy
Marina Marino
Dipartimento di Scienze Politiche, University of Naples Federico II, Naples, Italy
Francesco Palumbo

Authors

Cristina Tortora
View author publications
You can also search for this author in PubMed Google Scholar
Mireille Gettler Summa
View author publications
You can also search for this author in PubMed Google Scholar
Marina Marino
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Palumbo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Palumbo.

Appendix 1

Correlation matrix of wine data set (Table 2), values equal to or higher than 0.5 in bold.

Table 2 Correlation matrix of wine data set

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tortora, C., Summa, M.G., Marino, M. et al. Factor probabilistic distance clustering (FPDC): a new clustering method. Adv Data Anal Classif 10, 441–464 (2016). https://doi.org/10.1007/s11634-015-0219-5

Download citation

Received: 17 April 2014
Revised: 15 September 2015
Accepted: 28 September 2015
Published: 26 October 2015
Issue Date: December 2016
DOI: https://doi.org/10.1007/s11634-015-0219-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Factor probabilistic distance clustering (FPDC): a new clustering method

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Bayesian probabilistic distance clustering algorithm

Clustering and dimension reduction for mixed variables

FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix 1

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Factor probabilistic distance clustering (FPDC): a new clustering method

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Bayesian probabilistic distance clustering algorithm

Clustering and dimension reduction for mixed variables

FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation