Abstract
Clustering algorithms help identify homogeneous subgroups from data. In some cases, additional information about the relationship among some subsets of the data exists. When using a semi-supervised clustering algorithm, an expert may provide additional information to constrain the solution based on that knowledge and, in doing so, guide the algorithm to a more useful and meaningful solution. Such additional information often takes the form of a cannot-link constraint (i.e., two data points cannot be part of the same cluster) or a must-link constraint (i.e., two data points must be part of the same cluster). A key challenge for users of such constraints in semi-supervised learning algorithms, however, is that the addition of inaccurate or conflicting constraints can decrease accuracy and little is known about how to detect whether expert-imposed constraints are likely incorrect. In the present work, we introduce a method to score each must-link and cannot-link pairwise constraint as likely incorrect. Using synthetic experimental examples and real data, we show that the resulting impact score can successfully identify individual constraints that should be removed or revised.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We focus on discrete labels (e.g., classes) for simplicity of exposition, although there are numerous unsupervised (e.g., latent trait models) and supervised models (e.g., regression) which focus on continuous outcomes.
References
Aggarwal CC (2015) Data mining. Springer, Berlin. https://doi.org/10.1007/978-3-319-14142-8
Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248
Aloise D, Hansen P, Liberti L (2010) An improved column generation algorithm for minimum sum-of-squares clustering. Math Program 131(1–2):195–220. https://doi.org/10.1007/s10107-010-0349-7
Anil J, Rong J, Radha C (2015) Semi-supervised clustering. Book section semi-supervised clustering. CRC Press, Boca Raton. https://doi.org/10.1201/b19706-26
Ares ME, Parapar J, Barreiro A (2012) An experimental study of constrained clustering effectiveness in presence of erroneous constraints. Inf Process Manag 48(3):537–551. https://doi.org/10.1016/j.ipm.2011.08.006
Avella P, Sassano A, Vasil’ev I (2007) Computational study of large-scale p-median problems. Math Program 109(1):89–114. https://doi.org/10.1007/s10107-005-0700-6
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding, In: Proceedings of the nineteenth international conference on machine learning, vol 656012. Morgan Kaufmann Publishers Inc., pp 27–34
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. Soc Ind Appl Math. https://doi.org/10.1137/1.9781611972740.31
Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints. In: Semi-supervised learning. pp 71–98
Bertsimas D, Tsitsiklis J (1997) Introduction to linear optimization, 1st edn. Athena Scientific, Belmont
Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Blanchard SJ, Aloise D, DeSarbo WS (2012) The heterogeneous p-median problem for categorization based clustering. Psychometrika 77(4):741–762. https://doi.org/10.1007/s11336-012-9283-3
Brucker P (1978) On the complexity of clustering problems. In: Henn R, Korte B, Oettli W (eds) Optimization and operations research. Springer, Berlin, pp 45–54. https://doi.org/10.1007/978-3-642-95322-4_5
Campello RJ, Moulavi D, Zimek A, Sander J (2013) A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min Knowl Discov 27(3):344–371
Christou IT (2011) Coordination of cluster ensembles via exact methods. IEEE Trans Pattern Anal Mach Intell 33(2):279–93. https://doi.org/10.1109/TPAMI.2010.85
Costa LR, Aloise D, Mladenović N (2017) Less is more: basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering. Inf Sci 415:247–253
Davidson I (2012) Two approaches to understanding when constraints help clustering. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD ’12. ACM, 2339734, pp 1312–1320. https://doi.org/10.1145/2339530.2339734
Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the k-means algorithm. https://doi.org/10.1137/1.9781611972757.13
Davidson I, Ravi SS (2006)Identifying and generating easy sets of constraints for clustering. In: Proceedings of the 21st national conference on artificial intelligence—Volume 1, vol 1597593. AAAI Press, pp 336–341
Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 115–126
Delattre M, Hansen P (1980) Bicriterion cluster analysis. IEEE Trans Pattern Anal Mach Intell PAMI 2(4):277–291. https://doi.org/10.1109/TPAMI.1980.4767027
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Edwards AWF, Cavalli-Sforza LL (1965) A method for cluster analysis. Biometrics 21(2):362–375
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188
Fisher ML (1981) The Lagrangian relaxation method for solving integer programming problems. Manag Sci 27(1):1–18. https://doi.org/10.1287/mnsc.27.1.1
García S, Labbé M, Marín A (2011) Solving large p-median problems with a radius formulation. INFORMS J Comput 23(4):546–556. https://doi.org/10.1287/ijoc.1100.0418
Grossi V, Romei A, Turini F (2017a) Survey on using constraints in data mining. Data Min Knowl Discov 31(2):424–464. https://doi.org/10.1007/s10618-016-0480-z
Grossi V, Romei A, Turini F (2017b) Survey on using constraints in data mining. Data Min Knowl Discov 31(2):424–464
Hansen P, Jaumard B (1997) Cluster analysis and mathematical programming. Math Program 79(1–3):191–215. https://doi.org/10.1007/bf02614317
Hansen P, Brimberg J, Urošević D, Mladenović N (2009) Solving large p-median clustering problems by primal-dual variable neighborhood search. Data Min Knowl Discov 19(3):351–375
Held M, Wolfe P, Crowder HP (1974) Validation of subgradient optimization. Math Program 6(1):62–88. https://doi.org/10.1007/bf01580223
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Kariv O, Hakimi SL (1979) An algorithmic approach to network location problems. II: the p-medians. SIAM J Appl Math 37:539–560. https://doi.org/10.1137/0137041
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, Hoboken
Kim S, Blanchard SJ, DeSarbo WS, Fong DK (2013) Implementing managerial constraints in model-based segmentation: extensions of Kim, Fong, and DeSarbo (2012) with an application to heterogeneous perceptions of service quality. J Mark Res 50(5):664–673
Kochetov Y, Ivanenko D (2005) Computationally difficult instances for the uncapacitated facility location problem. Operations research/computer science interfaces series. Springer, Boston, pp 351–367
Mallapragada PK, Jin R, Jain AK (2008) Active query selection for semi-supervised clustering. In: 2008 19th international conference on pattern recognition. IEEE, pp 1–4. https://doi.org/10.1109/ICPR.2008.4761792
Pinheiro DN, Aloise D, Blanchard SJ (2020) Convex fuzzy k-medoids clustering. Fuzzy Sets Syst 389:66–92
Randel R, Aloise D, Mladenović N, Hansen P (2019) On the k-medoids model for semi-supervised clustering. In: Sifaleras A, Salhi S, Brimberg J (eds) Variable neighborhood search. Springer, Cham, pp 13–27
Resende MGC, Werneck RF (2007) A fast swap-based local search procedure for location problems. Ann Oper Res 150(1):205–230. https://doi.org/10.1007/s10479-006-0154-0
Santi É, Aloise D, Blanchard SJ (2016) A model for clustering data from heterogeneous dissimilarities. Eur J Oper Res 253(3):659–672
Shor NZ, Kiwiel KC, Ruszcaynski A (1985) Minimization methods for nondifferentiable functions. Springer, Berlin
Wagstaff KL (2007) Value, cost, and sharing: open issues in constrained clustering. In: Džeroski S, Struyf J (eds) Knowledge discovery in inductive databases. Springer, Berlin, pp 1–10
Wagstaff K, Cardie C, Rogers S, Schrödl S (2001). Constrained k-means clustering with background knowledge, vol ICML ’01. In: Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 577–584
Xiong S, Azimi J, Fern XZ (2014) Active learning of constraints for semi-supervised clustering. IEEE Trans Knowl Data Eng 26(1):43–54. https://doi.org/10.1109/tkde.2013.22
Xiong C, Johnson DM, Corso JJ (2017) Active clustering with model-based uncertainty reduction. IEEE Trans Pattern Anal Mach Intell 39(1):5–17. https://doi.org/10.1109/TPAMI.2016.2539965
Zhu X, Goldberg AB, Brachman R, Dietterich T (2009) Introduction to semi-supervised learning. Morgan and Claypool Publishers, San Rafael
Acknowledgements
The authors would like to thank three anonymous referees for their valuable and constructive suggestions. This research was enabled in part by support provided by Calcul Québec (https://www.calculquebec.ca) and Compute Canada (www.computecanada.ca). D. Aloise and A. Hertz also thank the Natural Sciences and Engineering Research Council of Canada (NSERC) for its financial support under grants No. 2017-05617 and 2017-05688.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Ian Davidson.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Randel, R., Aloise, D., Blanchard, S.J. et al. A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering. Data Min Knowl Disc 35, 2341–2368 (2021). https://doi.org/10.1007/s10618-021-00794-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00794-0