Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results | Data Mining and Knowledge Discovery Skip to main content
Log in

Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering with constraints is a powerful method that allows users to specify background knowledge and the expected cluster properties. Significant work has explored the incorporation of instance-level constraints into non-hierarchical clustering but not into hierarchical clustering algorithms. In this paper we present a formal complexity analysis of the problem and show that constraints can be used to not only improve the quality of the resultant dendrogram but also the efficiency of the algorithms. This is particularly important since many agglomerative style algorithms have running times that are quadratic (or faster growing) functions of the number of instances to be clustered. We present several bounds on the improvement in the running times of algorithms obtainable using constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Bae E, Bailey J (2006) COALA: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: Proceedings of the 6th IEEE international conference on data mining (ICDM 2006), Hong Kong, Dec 2006, pp 53–62

  • Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th international conference on machine learning (ICML 2002), Sydney, Australia, Jul 2002, pp 27–34

  • Basu S, Bilenko M, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 4th SIAM international conference on data mining (SDM 2004), Lake Buena Vista, FL, Apr 2004, pp 333–344

  • Basu S, Davidson I, Wagstaff K (2008) Advances in clustering with constraints: algorithms, theory and practice. Chapman & Hall, CRC Press (to appear)

  • Cormen T, Leiserson CE, Rivest R and Stein C (2001). Introduction to algorithms. MIT Press, Cambridge

    MATH  Google Scholar 

  • Davidson I, Ravi SS (2005a) Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of the SIAM international conference on data mining (SDM 2005), Newport Beach, CA, Apr 2005, pp 138–149

  • Davidson I, Ravi SS (2005b) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Proceedings of the 15th european conference on principles and practice of knowledge discovery in databases (PKDD 2005), Porto, Portugal, Oct 2005, pp 59–70

  • Davidson I, Wagstaff K, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Proceedings of the 16th european conference on principles and practice of knowledge discovery in databases (PKDD 2006), Berlin, Germany, Sept 2006, pp 115–126

  • Davidson I, Ravi SS (2006) Identifying and generating easy sets of constraints for clustering. In: Proceedings of the 21st national conference on artificial intelligence (AAAI 2006), Boston, MA, Jul 2006

  • Davidson I and Ravi SS (2007a). The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min Know Disc 14(1): 25–61

    Article  MathSciNet  Google Scholar 

  • Davidson I, Ravi SS (2007b) Intractability and clustering with constraints. In: Proceedings of the international conference on machine learning (ICML 2007), Portland, OR, June 2007, pp 201–208

  • Davidson I, Ester M, Ravi SS (2007) Efficient incremental clustering with constraints. In: Proceedings of the ACM conference of knowledge discovery and data mining (KDD 2007), San Jose, CA, Aug 2007, pp 240–249

  • Dragomirescu L and Postelnicu T (2007). A natural agglomerative clustering method for biology. Biometrical J 33(7): 841–849

    Article  Google Scholar 

  • Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the international conference on machine learning (ICML 2003), Washington, DC, Aug 2003, pp 147–153

  • Graham RL, Knuth DE and Patashnik O (1989). Concrete mathematics: a foundation for computer science. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the international conference on machine learning (ICML 2002), Sydney, Australia, Jul 2002, pp 307–314

  • Mitzenmacher M and Upfal E (2005). Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, New York

    MATH  Google Scholar 

  • Mohanta P, Mukherjee D, Acton S (2002) Agglomerative clustering for image segmentation. In: Proceedings of the international conference on pattern recognition (ICPR 2002), Qubec City, Canada, Aug 2002, pp 664–667

  • Nanni M (2005) Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In: Proceedings of the 9th pacific asia conference on knowledge discovery and data mining (PAKDD 2005), Hanoi, Vietnam, May 2005, pp 378–387

  • Schaefer TJ (1978) The complexity of satisfiability problems. In: Proceedings of the 10th ACM international symposium on theory of computing (STOC 1978), San Diego, CA, May 1978, pp 216–226

  • Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML 2000), Stanford, CA, Jun–Jul 2000, pp 1103–1110

  • Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained K-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning (ICML 2001), Williamstown, MA, Jun–Jul 2001, pp 577–584

  • Xing E, Ng A, Jordan M, Russell S (2002) Distance metric learning with application to clustering with side-information. In: Advances in neural information processing systems (NIPS 2002), Dec 2002, vol 15. Vancouver, Canada, pp 505–512

  • Zho Y and Karypis G (2005). Hierarchical clustering algorithms for document datasets. Data Min Know Disc 10(2): 141–168

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ian Davidson.

Additional information

Responsible editor: Charles Elkan.

A preliminary version of this paper appeared as Davidson and Ravi (2005b).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davidson, I., Ravi, S.S. Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min Knowl Disc 18, 257–282 (2009). https://doi.org/10.1007/s10618-008-0103-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0103-4

Keywords