On Strategies to Fix Degenerate k-means Solutions

Aloise, Daniel; Damasceno, Nielsen Castelo; Mladenović, Nenad; Pinheiro, Daniel Nobre

doi:10.1007/s00357-017-9231-0

On Strategies to Fix Degenerate k-means Solutions

Published: 10 June 2017

Volume 34, pages 165–190, (2017)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Daniel Aloise^1,4,
Nielsen Castelo Damasceno²,
Nenad Mladenović³ &
…
Daniel Nobre Pinheiro²

257 Accesses
Explore all metrics

Abstract

k-means is a benchmark algorithm used in cluster analysis. It belongs to the large category of heuristics based on location-allocation steps that alternately locate cluster centers and allocate data points to them until no further improvement is possible. Such heuristics are known to suffer from a phenomenon called degeneracy in which some of the clusters are empty. In this paper, we compare and propose a series of strategies to circumvent degenerate solutions during a k-means execution. Our computational experiments show that these strategies are effective, leading to better clustering solutions in the vast majority of the cases in which degeneracy appears in k-means. Moreover, we compare the use of our fixing strategies within k-means against the use of two initialization methods found in the literature. These results demonstrate how useful the proposed strategies can be, specially inside memorybased clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

References

ALOISE, D., DESHPANDE, A., HANSEN, P., and POPAT, P. (2009), “NP-Hardness of Euclidean Sum-of-Squares Clustering”, Machine Learning, 75, 245–249.
Article Google Scholar
ALOISE, D., HANSEN, P., and LIBERTI, L. (2012), “An Improved Column Generation Algorithm for Minimum Sum-of-Squares Clustering”, Mathematical Programming, 131(1-2), 195–220.
Article MathSciNet MATH Google Scholar
ARTHUR, D., and VASSILVITSKII, S. (2007). “K-means++: The Advantages of Careful Seeding”, In 2007 ACM-SIAM Symposium on Discrete Algorithms (SODA’07), pp. 1027–1035.
BLANCHARD, S.J., ALOISE, D., and DESARBO, W.S. (2012), “The Heterogeneous PMedian Problem for Categorization Based Clustering”, Psychometrika, 77(4), 741–762.
Article MathSciNet MATH Google Scholar
BRADLEY, P.S., and FAYYAD, U.M. (1998), “Refining Initial Points for k-Means Clustering”, in International Conference on Machine Learning (ICML), Vol. 98, pp. 91–99.
Google Scholar
BRIMBERG, J., and MLADENOVIĆ, N. (1999), “Degeneracy in the Multi-Source Weber Problem”, Mathematical Programming, 85(1), 213–220.
Article MathSciNet MATH Google Scholar
BRUSCO, M.J., and STEINLEY, D. (2007), “A Comparison of Heuristic Procedures for MinimumWithin-Cluster Sums of Squares Partitioning”, Psychometrika, 72(4), 583–600.
Article MathSciNet MATH Google Scholar
CARRIZOSA, E., ALGUWAIZANI, A., HANSEN, P., and MLADENOVIĆ, N. (2015), “New Heuristic for Harmonic Means Clustering”, Journal of Global Optimization, 63, 427–443.
Article MathSciNet MATH Google Scholar
CHOROMANSKA, A., and MONTELEONI, C. (2012), “Online Clustering with Experts”, in International Conference on Artificial Intelligence and Statistics, pp. 227–235.
COOPER, L. (1964), “Heuristic Methods for Location-Allocation Problems”, Siam Review, 6(1), 37–53.
Article MathSciNet MATH Google Scholar
DING, Y., ZHAO, Y., SHEN, X., MUSUVATHI, M., and MYTKOWICZ, T. (2015), “Yinyang k-means: A Drop-in Replacement of the Classic k-Means with Consistent Speedup”, in 32nd International Conference on Machine Learning (ICML-15), pp. 579–587.
EILON, S., WATSON-GANDY, C., and CHRISTOFIDES, N. (1971), Distributed Management, New York: Hafner.
FORGY, E. (1965), “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications”, Biometrics, 21, 768.
HANSEN, P., and Mladenović, N. (2001), “J-Means: A New Local Search Heuristic for Minimum Sum of Squares Clustering”, Pattern Recognition, 34, 405–413.
Article MATH Google Scholar
HANSEN, P., NGAI, E., CHEUNG, B.K., and MLADENOVIC, N. (2005), “Analysis of Global k-Means, An Incremental Heuristic for Minimum Sum-of-Squares Clustering”, Journal of Classification, 22(2), 287–310.
Article MathSciNet MATH Google Scholar
HAVERLY, C.A. (1978), “Studies of the Behavior of Recursion for the Pooling Problem”, ACM SIGMAP Bulletin, (25), 19–28.
Article Google Scholar
HELSEN, K., and GREEN, P.E. (1991), “A Computational Study of Replicated Clustering with an Application toMarket Segmentation”, Decision Sciences, 22(5), 1124–1141.
Article Google Scholar
HOFMANS, J., CEULEMANS, E., STEINLEY, D., and VAN MECHELEN, I. (2015), “On the Added Value of Bootstrap Analysis for k-Means Clustering”, Journal of Classification, 32(2), 268–284.
Article MathSciNet MATH Google Scholar
INABA, M., KATOH, N., and IMAI, H. (1994), “Applications ofWeighted Voronoi Diagrams and Randomization to Variance-Based k-Clustering”, in Proceedings of the 10th ACM Symposium on Computational Geometry, pp. 332–339.
JAIN, R. (2008), The Art of Computer Systems Performance Analysis, New York: John Wiley and Sons.
Google Scholar
LICHMAN, M. (2013), UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Proceedings of 5 ^th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 2, Berkely, CA, pp. 281–297.
MAHAJAN, M., NIMBHORKAR, P., and VARADARAJAN, K. (2009), “The Planar k-Means Problem is NP-Hard”, Lecture Notes in Computer Science, 5431, 274–285.
Article MathSciNet MATH Google Scholar
MAIRAL, J., BACH, F., and PONCE, J. (2012), “Sparse Modeling for Image and Vision Processing”, Foundations and Trends in Computer Graphics and Vision, 8(2-3), 85–283.
Article MATH Google Scholar
MAK, J.N., and WOLPAW, J.R. (2009), “Clinical Applications of Brain-Computer Interfaces: Current State and Future Prospects”, IEEE Reviews in Biomedical Engineering, 2, 187–199.
Article Google Scholar
NUGENT, R., DEAN, N., and AYERS, E. (2010), “Skill Set Profile Clustering: The Empty k-Means Algorithm with Automatic Specification of Starting Cluster Centers”, in Educational Data Mining 2010, pp. 151–160.
Google Scholar
ORDIN, B., and BAGIROV, A.M. (2015), “A Heuristic Algorithm for Solving the Minimum Sum-of-Squares Clustering Problems”, Journal of Global Optimization, 61(2), 341–361.
Article MathSciNet MATH Google Scholar
PACHECO, J., and VALENCIA, O. (2003), “Design of Hybrids for the Minimum Sum-of-Squares Clustering Problem”, Computational Statistics and Data Analysis, 43(2), 235–248.
Article MathSciNet MATH Google Scholar
RUSPINI, E. (1970), “Numerical Method for Fuzzy Clustering”, Information Sciences, 2, 319–350.
STEINLEY, D. (2006), “K-Means Clustering: A Half-Century Synthesis”, British Journal of Mathematical and Statistical Psychology, 59(1), 1–34.
Article MathSciNet Google Scholar
STEINLEY, D., and BRUSCO, M.J. (2007), “Initializing k-Means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24(1), 99–121.
Article MathSciNet MATH Google Scholar
TAO, P.D. et al. (2014), “New and Efficient Dca Based Algorithms for Minimum Sum-of-Squares Clustering”, Pattern Recognition, 47(1), 388–401.
Article MATH Google Scholar
TEBOULLE, M. (2007), “A Unified Continous Optimization Framework for Center-Based Clustering Methods”, Journal of Machine Learning Research, (8), 65–102.
WARD JR., J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58(301), 236–244.
WU, X., and KUMAR, V. (2009), The Top Ten Algorithms in Data Mining, CRC Press.

Download references

Author information

Authors and Affiliations

Polytechnique Montréal, Montréal, Canada
Daniel Aloise
Federal University of Rio Grande do Norte, do Norte, Rio Grande, Brazil
Nielsen Castelo Damasceno & Daniel Nobre Pinheiro
LAMIH, Université de Valenciennes et du Hainaut Cambrésis, Valenciennes, France
Nenad Mladenović
Department of Computer Engineering, Polytechnique Montréal, Montréal, Canada
Daniel Aloise

Authors

Daniel Aloise
View author publications
You can also search for this author in PubMed Google Scholar
Nielsen Castelo Damasceno
View author publications
You can also search for this author in PubMed Google Scholar
Nenad Mladenović
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Nobre Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Aloise.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aloise, D., Damasceno, N.C., Mladenović, N. et al. On Strategies to Fix Degenerate k-means Solutions. J Classif 34, 165–190 (2017). https://doi.org/10.1007/s00357-017-9231-0

Download citation

Published: 10 June 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s00357-017-9231-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

On Strategies to Fix Degenerate k-means Solutions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A K-Means Variation Based on Careful Seeding and Constrained Silhouette Coefficients

Comparison of K-means Clustering Initialization Approaches with Brute-Force Initialization

Global k-means++: an effective relaxation of the global k-means clustering algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

On Strategies to Fix Degenerate k-means Solutions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A K-Means Variation Based on Careful Seeding and Constrained Silhouette Coefficients

Comparison of K-means Clustering Initialization Approaches with Brute-Force Initialization

Global k-means++: an effective relaxation of the global k-means clustering algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now