Correlation Clustering

Bansal, Nikhil; Blum, Avrim; Chawla, Shuchi

doi:10.1023/B:MACH.0000033116.57574.95

Correlation Clustering

Published: July 2004

Volume 56, pages 89–113, (2004)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Correlation Clustering

Download PDF

Nikhil Bansal¹,
Avrim Blum¹ &
Shuchi Chawla¹

9865 Accesses
665 Citations
12 Altmetric
Explore all metrics

Abstract

We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem.

An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Alon, N., Fischer, E., Krivelevich, M., & Szegedy, M. (2000). Efficient testing of large graphs. Combinatorica, 20:4, 451–476.
Google Scholar
Alon, N., & Spencer, J. H. (1992). The probabilistic method. John Wiley and Sons.
Arora, S., Frieze, A., & Kaplan, H. (2002). A new rounding procedure for the assignment problem with applications to dense graph arrangements. Mathematical Programming, 92:1, 1–36.
Google Scholar
Arora, S., Karger, D., & Karpinski, M. (1999). Polynomial time approximation schemes for dense instances of NP-Hard problems. JCSS, 58:1, 193–210.
Google Scholar
Ben-David, S., Long, P. M., & Mansour, Y. (2001). Agnostic boosting. In Proceedings of the 2001 Conference on Computational Learning Theory (pp. 507–516).
Blum, A., & Karger, D. (1997). An Õ(n ^3/14)-coloring algorithm for 3-colorable graphs. IPL: Information Processing Letters, 61.
Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science.
Charikar, M., Guruswami, V., & Wirth, A. (2003). Clustering with qualitative information. In Proceedings of the 44th Annual Symposium on Foundations of Computer Science (pp. 524–533).
Cohen, W., & McCallum, A. (2001). Personal communication.
Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. In ACM SIGIR'01 Workshop on Mathematical/Formal Methods in IR.
Cohen, W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
Condon, A., & Karp, R. (1999). Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18:2, 116–140.
Google Scholar
de la Vega, F. (1996). MAX-CUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms, 8:3, 187–198.
Google Scholar
Demaine, E., & Immorlica, N. (2003). Correlation clustering with partial information. In Proceedings of APPROX.
Emanuel, D., & Fiat, A. (2003). Correlation clustering—Minimizing disagreements on arbitrary weighted graphs. In Proceedings of ESA.
Garey, M., & Johnson, D. (2000). Computers and intractability: A guide to the theory of NP-completeness. W. H. Freeman and Company.
Goldreich, O., Goldwasser, S., & Ron, D. (1998). Property testing and its connection to learning and approximation. JACM, 45:4, 653–750.
Google Scholar
Hochbaum, D., & Shmoys, D. (1986). A unified approach to approximation algorithms for bottleneck problems. JACM, 33, 533–550.
Google Scholar
Jain, K., & Vazirani, V. (2001). Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation. JACM, 48:2, 274–296.
Google Scholar
Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries. JACM, 45:6, 983–1006.
Google Scholar
Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efficient agnostic learning. Machine Learning, 17:2/3, 115–142.
Google Scholar
McSherry, F. (2001). Spectral partitioning of random graphs. In Proceedings of the 42th Annual Symposium on Foundations of Computer Science (pp. 529–537).
Parnas, M., & Ron, D. (2002). Testing the diameter of graphs. Random Structures and Algorithms, 20:2, 165–183.
Google Scholar
Schulman, L. (2000). Clustering for edge-cost minimization. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (pp. 547–555).
Shamir, R., & Tsur, D. (2002). Improved algorithms for the random cluster graph model. In Proceedings of the Scandinavian Workshop on Algorithmic Theory (pp. 230–239).
Swamy, C. (2004). Correlation clustering: Maximizing agreements via semidefinite programming. In Proceedings of the Symposium on Discrete Algorithms.

Download references

Author information

Authors and Affiliations

Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Nikhil Bansal, Avrim Blum & Shuchi Chawla

Authors

Nikhil Bansal
View author publications
You can also search for this author in PubMed Google Scholar
Avrim Blum
View author publications
You can also search for this author in PubMed Google Scholar
Shuchi Chawla
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bansal, N., Blum, A. & Chawla, S. Correlation Clustering. Machine Learning 56, 89–113 (2004). https://doi.org/10.1023/B:MACH.0000033116.57574.95

Download citation

Issue Date: July 2004
DOI: https://doi.org/10.1023/B:MACH.0000033116.57574.95

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Correlation Clustering

Abstract

Article PDF

Similar content being viewed by others

Simple Iterative Heuristics for Correlation Clustering

LP-Based Pivoting Algorithm for Higher-Order Correlation Clustering

LP-based pivoting algorithm for higher-order correlation clustering

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Correlation Clustering

Abstract

Article PDF

Similar content being viewed by others

Simple Iterative Heuristics for Correlation Clustering

LP-Based Pivoting Algorithm for Higher-Order Correlation Clustering

LP-based pivoting algorithm for higher-order correlation clustering

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation