Abstract
Clustering with constraints is a powerful method that allows users to specify background knowledge and the expected cluster properties. Significant work has explored the incorporation of instance-level constraints into non-hierarchical clustering but not into hierarchical clustering algorithms. In this paper we present a formal complexity analysis of the problem and show that constraints can be used to not only improve the quality of the resultant dendrogram but also the efficiency of the algorithms. This is particularly important since many agglomerative style algorithms have running times that are quadratic (or faster growing) functions of the number of instances to be clustered. We present several bounds on the improvement in the running times of algorithms obtainable using constraints.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bae E, Bailey J (2006) COALA: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: Proceedings of the 6th IEEE international conference on data mining (ICDM 2006), Hong Kong, Dec 2006, pp 53–62
Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th international conference on machine learning (ICML 2002), Sydney, Australia, Jul 2002, pp 27–34
Basu S, Bilenko M, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 4th SIAM international conference on data mining (SDM 2004), Lake Buena Vista, FL, Apr 2004, pp 333–344
Basu S, Davidson I, Wagstaff K (2008) Advances in clustering with constraints: algorithms, theory and practice. Chapman & Hall, CRC Press (to appear)
Cormen T, Leiserson CE, Rivest R and Stein C (2001). Introduction to algorithms. MIT Press, Cambridge
Davidson I, Ravi SS (2005a) Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of the SIAM international conference on data mining (SDM 2005), Newport Beach, CA, Apr 2005, pp 138–149
Davidson I, Ravi SS (2005b) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Proceedings of the 15th european conference on principles and practice of knowledge discovery in databases (PKDD 2005), Porto, Portugal, Oct 2005, pp 59–70
Davidson I, Wagstaff K, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Proceedings of the 16th european conference on principles and practice of knowledge discovery in databases (PKDD 2006), Berlin, Germany, Sept 2006, pp 115–126
Davidson I, Ravi SS (2006) Identifying and generating easy sets of constraints for clustering. In: Proceedings of the 21st national conference on artificial intelligence (AAAI 2006), Boston, MA, Jul 2006
Davidson I and Ravi SS (2007a). The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min Know Disc 14(1): 25–61
Davidson I, Ravi SS (2007b) Intractability and clustering with constraints. In: Proceedings of the international conference on machine learning (ICML 2007), Portland, OR, June 2007, pp 201–208
Davidson I, Ester M, Ravi SS (2007) Efficient incremental clustering with constraints. In: Proceedings of the ACM conference of knowledge discovery and data mining (KDD 2007), San Jose, CA, Aug 2007, pp 240–249
Dragomirescu L and Postelnicu T (2007). A natural agglomerative clustering method for biology. Biometrical J 33(7): 841–849
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the international conference on machine learning (ICML 2003), Washington, DC, Aug 2003, pp 147–153
Graham RL, Knuth DE and Patashnik O (1989). Concrete mathematics: a foundation for computer science. Addison-Wesley, Reading
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the international conference on machine learning (ICML 2002), Sydney, Australia, Jul 2002, pp 307–314
Mitzenmacher M and Upfal E (2005). Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, New York
Mohanta P, Mukherjee D, Acton S (2002) Agglomerative clustering for image segmentation. In: Proceedings of the international conference on pattern recognition (ICPR 2002), Qubec City, Canada, Aug 2002, pp 664–667
Nanni M (2005) Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In: Proceedings of the 9th pacific asia conference on knowledge discovery and data mining (PAKDD 2005), Hanoi, Vietnam, May 2005, pp 378–387
Schaefer TJ (1978) The complexity of satisfiability problems. In: Proceedings of the 10th ACM international symposium on theory of computing (STOC 1978), San Diego, CA, May 1978, pp 216–226
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML 2000), Stanford, CA, Jun–Jul 2000, pp 1103–1110
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained K-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning (ICML 2001), Williamstown, MA, Jun–Jul 2001, pp 577–584
Xing E, Ng A, Jordan M, Russell S (2002) Distance metric learning with application to clustering with side-information. In: Advances in neural information processing systems (NIPS 2002), Dec 2002, vol 15. Vancouver, Canada, pp 505–512
Zho Y and Karypis G (2005). Hierarchical clustering algorithms for document datasets. Data Min Know Disc 10(2): 141–168
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charles Elkan.
A preliminary version of this paper appeared as Davidson and Ravi (2005b).
Rights and permissions
About this article
Cite this article
Davidson, I., Ravi, S.S. Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min Knowl Disc 18, 257–282 (2009). https://doi.org/10.1007/s10618-008-0103-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0103-4