Abstract
The paper discusses a new approach for incorporating hard constraints into the K-means algorithm for semi-supervised clustering. An analytic modification of the objective function of K-means is proposed that has not been previously considered in the literature.
References
Barbier, G., Zafarani, R., Gao, H., Fung, G., Liu, H. (2012). Maximizing benefits from crowdsourced data. Computational and Mathematical Organization Theory, 18, 257–279.
Basu, S., Banerjee, A., Mooney, R. (2002). Semi-supervised clustering by seeding. In Proceedings of the 19th international conference on machine learning (pp. 19–26).
Basu, S., Banerjee, A., Mooney, R. (2004). Active semi-supervision for pairwise constrained clustering. In Proceedings of the SIAM international conference on data mining.
Basu, S., Davidson, I., Wagstaff, K. (2008). Constrained clustering: advances in algorithms, theory, and applications. Boca Raton: CRC Press.
Bilenko, M., & Mooney, J.R. (2003). Adaptive duplicate detection using learnable string similarity measures. In International conference on knowledge discovery and data mining (pp. 39–48).
Celeux, G., & Govaert, G. (1993). Comparison of the mixture and the classification maximum likelihood in cluster analysis. Journal of Statistical Computation and Simulation, 47, 127–146.
Covões, T.F., Hruschka, E.R., Ghosh, J. (2013). A study of k-means-based algorithms for constrained clustering. Intelligent Data Analysis, 17, 485–505.
Davidson, I., & Ravi, S. (2005). Clustering with constraints: feasibility issues and the k-means algorithm. In Proceedings of the 2005 SIAM international conference on data mining (pp. 138–149): SIAM.
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38.
DeSarbo, W.S., & Mahajan, V. (1984). Constrained classification: the use of a priori information in cluster analysis. Psychometrika, 49, 187–215.
Dinler, D., & Tural, M.K. (2016). A survey of constrained clustering. In Unsupervised learning algorithms (pp. 207–235): Springer.
Fatehi, K., Bozorgi, A., Zahedi, M.S., Asgarian, E. (2014). Improving semi-supervised constrained k-means clustering method using user feedback. Journal of Computing and Security, 1, 273–261.
Gu, L., & Lu, X. (2012). Semi-supervised subtractive clustering by seeding. In 2012 9th international conference on fuzzy systems and knowledge discovery (pp. 738–741): IEEE.
Hennig, C., Meila, M., Murtagh, F., Rocci, R. (2015). Handbook of cluster analysis. Boca Raton: CRC Press.
Liu, H., & Fu, Y. (2015). Clustering with partition level side information. In 2015 IEEE international conference on data mining (pp. 877–882): IEEE.
Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19, 354–376.
McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Melnykov, V., Chen, W.-C., Maitra, R. (2012). Mixsim: an R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51, 1–25.
Melnykov, V., Melnykov, I., Michael, S. (2016). Semi-supervised model-based clustering with positive and negative constraints. Advances in data analysis and classification, 10, 327–349.
Nimmo, D.W.R., Herrmann, S.J., Sublette, J.E., Melnykov, I.V., Helland, L.K., Romine, J.A., Carsella, J.S., Herrmann-Hoesing, L.M., Turner, J.A., Vanden Heuvel, B.D. (2018). Occurrence of Chironomid species (Diptera: Chironomidae) in the high Se-78 concentrations and high pH of Fountain Creek Watershed, Colorado, USA. Western North American Naturalist, 78, 39–64–26.
Ruiz, C., Spiliopoulou, M., Menasalvas, E. (2010). Density-based semi-supervised clustering. Data Mining and Knowledge Discovery, 21, 345–370.
Śmieja, M., & Wiercioch, M. (2017). Constrained clustering with a complex cluster structure. Advances in Data Analysis and Classification, 11, 493–518.
Steinley, D., & Brusco, M.J. (2011). Evaluating mixture modeling for clustering: recommendations and cautions. Psychological Methods, 16, 63.
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S. (2001). Constrained K-means clustering with background knowledge. In Proceedings of the eighteenth international conference on machine learning (ICML-2001) (pp. 577–584).
Wang, X., Wang, C., Shen, J. (2011). Semi–supervised K-means clustering by optimizing initial cluster centers. In International conference on web information systems and mining (pp. 178–187): Springer.
Yu, Z., Luo, P., You, J., Wong, H.-S., Leung, H., Wu, S., Zhang, J., Han, G. (2015). Incremental semi-supervised clustering ensemble for high dimensional data clustering. IEEE Transactions on Knowledge and Data Engineering, 28, 701–714.
Zhigang, C., Xuan, L., Fan, Y. (2013). Constrained k-means with external information. In 2013 8th International conference on computer science & education (pp. 490–493): IEEE.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Melnykov, I., Melnykov, V. A Note on the Formal Implementation of the K-means Algorithm with Hard Positive and Negative Constraints. J Classif 37, 789–809 (2020). https://doi.org/10.1007/s00357-019-09349-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-019-09349-x