A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering | Data Mining and Knowledge Discovery Skip to main content
Log in

A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering algorithms help identify homogeneous subgroups from data. In some cases, additional information about the relationship among some subsets of the data exists. When using a semi-supervised clustering algorithm, an expert may provide additional information to constrain the solution based on that knowledge and, in doing so, guide the algorithm to a more useful and meaningful solution. Such additional information often takes the form of a cannot-link constraint (i.e., two data points cannot be part of the same cluster) or a must-link constraint (i.e., two data points must be part of the same cluster). A key challenge for users of such constraints in semi-supervised learning algorithms, however, is that the addition of inaccurate or conflicting constraints can decrease accuracy and little is known about how to detect whether expert-imposed constraints are likely incorrect. In the present work, we introduce a method to score each must-link and cannot-link pairwise constraint as likely incorrect. Using synthetic experimental examples and real data, we show that the resulting impact score can successfully identify individual constraints that should be removed or revised.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. We focus on discrete labels (e.g., classes) for simplicity of exposition, although there are numerous unsupervised (e.g., latent trait models) and supervised models (e.g., regression) which focus on continuous outcomes.

References

Download references

Acknowledgements

The authors would like to thank three anonymous referees for their valuable and constructive suggestions. This research was enabled in part by support provided by Calcul Québec (https://www.calculquebec.ca) and Compute Canada (www.computecanada.ca). D. Aloise and A. Hertz also thank the Natural Sciences and Engineering Research Council of Canada (NSERC) for its financial support under grants No. 2017-05617 and 2017-05688.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo Randel.

Additional information

Responsible editor: Ian Davidson.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Randel, R., Aloise, D., Blanchard, S.J. et al. A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering. Data Min Knowl Disc 35, 2341–2368 (2021). https://doi.org/10.1007/s10618-021-00794-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00794-0

Keywords

Navigation