Hierarchical co-clustering: off-line and incremental approaches

Pensa, Ruggero G.; Ienco, Dino; Meo, Rosa

doi:10.1007/s10618-012-0292-8

Hierarchical co-clustering: off-line and incremental approaches

Published: 25 September 2012

Volume 28, pages 31–64, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Ruggero G. Pensa¹,
Dino Ienco^2,3 &
Rosa Meo¹

883 Accesses
Explore all metrics

Abstract

Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus “incorporating” data gradually into an on-going co-clustering solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Anagnostopoulos A, Dasgupta A, Kumar R (2008) Approximation algorithms for co-clustering. In: Proceedings of PODS 2008. ACM Press, Vancouver, pp 201–210
Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha D (2007) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J Mach Learn Res 8: 1919–1986
MATH MathSciNet Google Scholar
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the ACM SIGKDD 2004. ACM Press, Seattle, pp 79–88
Cherukuri VS, Candan K (2008) Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In: Proceeding of LSDS-IR 2008. ACM Press, Napa Valley, pp 3–10
Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. In: Proceedings of SIAM SDM 2004, Lake Buena Vista, USA
Costa G, Manco G, Ortale R (2008) A hierarchical model-based approach to co-clustering high-dimensional data. In: Proceedings of ACM SAC 2008. ACM Press, Fortaleza, pp 886–890
Deodhar M, Ghosh J (2010) SCOAL: a framework for simultaneous co-clustering and learning from complex data. Trans Knowl Discov Data 4(3): 11:1–11:31
Google Scholar
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ACM SIGKDD 2003. ACM Press, Washington, pp 89–98
Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of VLDB 1998. Morgan Kaufmann, New York, pp 323–333
Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2): 139–172
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305
MATH Google Scholar
George T, Merugu S (2005) A scalable collaborative filtering framework based on co-clustering. In: Proceedings of ICDM 2005. IEEE Computer Society, Houston, pp 625–628
Gong J, Oard D (2009) Selecting hierarchical clustering cut points for web person-name disambiguation. In: Proceedings of ACM SIGIR 2009. ACM Press, Boston, pp 778–779
Goodman LA, Kruskal WH (1954) Measures of association for cross classification. J Am Stat Assoc 49: 732–764
MATH Google Scholar
Goodman LA, Kruskal WH (1959) Measure of association for cross classification II: further discussion and references. J Am Stat Assoc 54: 123–163
Article MATH Google Scholar
Han J, Kamber M (2000) Data mining: concepts and techniques. The Morgan Kaufmann series in data management systems. Morgan Kaufmann, Burlington
Google Scholar
Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337): 123–129
Article Google Scholar
Heard NA, Holmes CC, Stephens DA, Hand DJ, Dimopoulos G (2005) Bayesian coclustering of anopheles gene expression time series: study of immune defense response to multiple experimental challenges. Proc Natl Acad Sci 102(47): 16939–16944
Article Google Scholar
Hosseini M, Abolhassani H (2007) Hierarchical co-clustering for web queries and selected urls. In: Proceedings of WISE 2007, LNCS, vol 4831. Springer, Nancy, pp 653–662
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1): 193–218
Article Google Scholar
Ienco D, Pensa R, Meo R (2009) Parameter-free hierarchical co-clustering by n-ary splits. In: Proceedings of ECML PKDD 2009, part I. Lecture notes in computer science, vol 5781. Springer, Bled, pp 580–595
Ienco D, Robardet C, Pensa R, Meo R (2012) Parameter-less co-clustering for star-structured heterogeneous data. Data Min Knowl Discov. doi:10.1007/s10618-012-0248-z
Jaszkiewicz A (2002) Genetic local search for multi-objective combinatorial optimization. Eur J Oper Res 137(1): 50–71
Article MATH MathSciNet Google Scholar
Khoshneshin M, Street W (2010) Incremental collaborative filtering via evolutionary co-clustering. In: Proceedings of ACM RecSys 2010. ACM Press, Barcelona, pp 325–328
Kim J, Candan K (2006) CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In: Proceedings of ACM CIKM 2006. ACM Press, Arlington, pp 483–492
Kluger Y, Basri R, Chang J, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13: 703–716
Article Google Scholar
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1): 1:1– 1:58
Article Google Scholar
Paquete L, Stützle T (2006) A study of stochastic local search algorithms for the biobjective qap with correlated flow matrices. Eur J Oper Res 169(3): 943–959
Article MATH Google Scholar
Robardet C (2002) Contribution à à la classification non supervisée: proposition d’une methode de bi-partitionnement. Ph.D. thesis, Université Claude Bernard, Lyon 1
Robardet C, Feschet F (2001a) Comparison of three objective functions for conceptual clustering. In: Proceedings of PKDD 2001, LNCS, vol 2168. Springer, Freiburg, pp 399–410
Robardet C, Feschet F (2001b) Efficient local search in conceptual clustering. In: Proceedings of DS 2001, LNCS, vol 2226. Springer, Washington, DC, pp 323–335
Sahoo N, Callan J, Krishnan R, Duncan G, Padman R (2006) Incremental hierarchical clustering of text documents. In: Proceedings of ACM CIKM 2006. ACM Press, Arlington, pp 357–366
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of ACM SIGIR 2000. ACM Press, Athens, pp 208–215
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617
MathSciNet Google Scholar
Zhang Y, Li T (2011) Extending consensus clustering to explore multiple clustering views. In: Proceedings of SIAM SDM 2011, pp 920–931

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Torino, 10149, Torino, Italy
Ruggero G. Pensa & Rosa Meo
IRSTEA Montpellier, UMR TETIS, 34093, Montpellier, France
Dino Ienco
LIRMM Montpellier, UMR CNRS, 34095, Montpellier, France
Dino Ienco

Authors

Ruggero G. Pensa
View author publications
You can also search for this author inPubMed Google Scholar
Dino Ienco
View author publications
You can also search for this author inPubMed Google Scholar
Rosa Meo
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ruggero G. Pensa.

Additional information

Responsible editor: Ian Davidson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pensa, R.G., Ienco, D. & Meo, R. Hierarchical co-clustering: off-line and incremental approaches. Data Min Knowl Disc 28, 31–64 (2014). https://doi.org/10.1007/s10618-012-0292-8

Download citation

Received: 20 January 2011
Accepted: 11 September 2012
Published: 25 September 2012
Issue Date: January 2014
DOI: https://doi.org/10.1007/s10618-012-0292-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical co-clustering: off-line and incremental approaches

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Co-clustering of multi-view datasets

A Co-occurrence Based Approach for Mining Overlapped Co-clusters in Binary Data

Co-Clustering for Object by Variable Data Matrices

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Hierarchical co-clustering: off-line and incremental approaches

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Co-clustering of multi-view datasets

A Co-occurrence Based Approach for Mining Overlapped Co-clusters in Binary Data

Co-Clustering for Object by Variable Data Matrices

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now