A feature-free and parameter-light multi-task clustering framework

Huy, Thach Nguyen; Shao, Hao; Tong, Bin; Suzuki, Einoshin

doi:10.1007/s10115-012-0550-5

A feature-free and parameter-light multi-task clustering framework

Regular Paper
Published: 11 September 2012

Volume 36, pages 251–276, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Thach Nguyen Huy¹,
Hao Shao¹,
Bin Tong¹ &
…
Einoshin Suzuki^1,2

488 Accesses
Explore all metrics

Abstract

The two last decades have witnessed extensive research on multi-task learning algorithms in diverse domains such as bioinformatics, text mining, natural language processing as well as image and video content analysis. However, all existing multi-task learning methods require either domain-specific knowledge to extract features or a careful setting of many input parameters. There are many disadvantages associated with prior knowledge requirements for feature extraction or parameter-laden approaches. One of the most obvious problems is that we may find a wrong or non-existent pattern because of poorly extracted features or incorrectly set parameters. In this work, we propose a feature-free and parameter-light multi-task clustering framework to overcome these disadvantages. Our proposal is motivated by the recent successes of Kolmogorov-based methods on various applications. However, such methods are only defined for single-task problems because they lack a mechanism to share knowledge between different tasks. To address this problem, we create a novel dictionary-based compression dissimilarity measure that allows us to share knowledge across different tasks effectively. Experimental results with extensive comparisons demonstrate the generality and the effectiveness of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Robust Discriminative multi-view K-means clustering with feature selection and group sparsity learning

Article 26 July 2018

Feature decomposition and structural learning for multi-diverse and multi-view data clustering

Article 26 October 2024

Multi-view Clustering Based on Non-negative Matrix Factorization

Notes

http://people.csail.mit.edu/jrennie/20Newsgroups/.
Referenced theorems used in this work are summarized in Appendix 7.2.
More details of this definition are given in Appendix 7.2.
http://www.un.org/en/documents/udhr/.
http://www.freelang.net/families/.
http://www.ncbi.nlm.nih.gov/genbank/.
http://www.wikipedia.org/.
http://people.csail.mit.edu/jrennie/20Newsgroups/.
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.
A dendrogram lists all of the instances and indicates at what level of dissimilarity two clusters were joined.

References

Agarwal D (2007) Detecting Anomalies in cross-classified streams: a Bayesian approach. Knowl Inf Syst (KAIS) 11(1):29–44
Article Google Scholar
Allison L, Stern L, Edgoose T, Dix TI (2000) Sequence complexity for biological sequence analysis. Comput Chem 24(1):43–55
Google Scholar
Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486
Article Google Scholar
Ando RK, Zhang T, Bartlett P (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
MathSciNet MATH Google Scholar
Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. In: NIPS, pp 41–48
Baralis E, Bruno G, Fiori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst (KAIS) 29(1):81–101
Article Google Scholar
Baxter J (1995) Learning internal representations. In: COLT: proceedings of the workshop on computational learning theory, pp 311–320
Baxter J (2000) A model of inductive bias learning. J Artif Intell Res 12:149–198
MathSciNet MATH Google Scholar
Benedetto D, Caglioti E, Loreto V (2002) Language trees and zipping. Phys Rev Lett 88(4):2–5
Article Google Scholar
Bhattacharya I, Godbole S, Joshi S, Verma A (2009) Cross-guided clustering: transfer of relevant supervision across domains for improved clustering. In: ICDM, pp 41–50
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: EMNLP, pp 120–128
Campana BJL, Keogh E (2010) A compression based distance measure for texture. In: SDM, pp 850–861
Caruana R (1997) Multitask learning. Mach Learn 28:41–75
Article Google Scholar
Chapelle O, Shivaswamy PK, Vadrevu S, Weinberger KQ, Zhang Y, Tseng BL (2010) Multi-task learning for boosting with application to web search ranking. In: KDD, pp 1189–1198
Chen X, Kwong S, Li M (2000) A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, pp 52–61
Cilibrasi R, Vitanyi P (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545
Article MathSciNet Google Scholar
Cilibrasi R, Vitányi P, Wolf RD (2004) Algorithmic clustering of music based on string compression. Comput Music J 28:49–67
Article Google Scholar
Cilibrasi R, Vitányi PMB (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383
Article Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York
Book MATH Google Scholar
Dai W, Yang Q, Xue GR, Yu Y (2007) Boosting for transfer learning. In: ICML, pp 193–200
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: KDD, pp 269–274
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
MathSciNet MATH Google Scholar
Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst (KAIS) 19(2):249–264
Article Google Scholar
Fodor IK (2002) A survey of dimension reduction techniques. Technical Report, US Department of Energy
Gu Q, Li Z, Han J (2011) Learning a kernel for multi-task clustering. In: AAAI, pp 368–373
Gu Q, Zhou J (2009) Learning the shared subspace for multi-task clustering and transductive transfer classification. In: ICDM, pp 159–168
Huy TN, Shao H, Tong B, Suzuki E. Website for this paper. http://sites.google.com/site/kaishuy/
Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst (KAIS) 25(1):35–55
Article Google Scholar
Jolliffe IT (2002) Principal component analysis, series: Springer series in statistics, 2nd edn. Springer, NY
Google Scholar
Juba B (2006) Estimating relatedness via data compression. In: ICML, pp 441–448
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20:359–392
Article MathSciNet Google Scholar
Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: KDD, pp 206–215
Lawrence ND, Platt JC (2004) Learning to learn with the informative vector machine. In: ICML, pp 65–72
Li M, Chen X, Li X, Ma B, Vitanyi P (2003) The similarity metric. In: ACM-SIAM symposium on discrete algorithms, pp 863–872
Liu Q, Liao X, Carin HL, Stack JR, Carin L (2009) Semisupervised multitask learning. IEEE Trans PAMI 31:1074–1086
Article Google Scholar
Liu Q, Xu Q, Zheng VW, Xue H, Cao Z, Yang Q (2010) Multi-task learning for cross-platform siRNA efficacy prediction: an In-silico study. BMC Bioinform 11:181
Article Google Scholar
Lovasz L, Plummer M (1986) Matching theory
Lowenstern D, Hirsh H, Noordiwier M, Yianilos P (1995) DNA sequence classification using compression-based induction. Technical report, Center for Discrete Mathematics and Theoretical Computer Science
Mahmud MM (2007) On universal transfer learning. In: ALT, pp 135–149
Mahmud MM, Ray SR (2008) Transfer learning using Kolmogorov complexity: basic theory and empirical evaluations. In: NIPS, pp 985–992
Ming L, Paul V (1997) An introduction to Kolmogorov complexity and its applications. Springer, New York
MATH Google Scholar
Molina LC, Belanche L, Nebot A (2002) Feature selection algorithms: a survey and experimental evaluation. In: ICDM, pp 306–313
Mount DW (2004) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press, New York
Google Scholar
Ozawa S, Roy A, Roussinov D (2009) A multitask learning model for online pattern recognition. IEEE Trans Neural Netw 20(3):430–445
Article Google Scholar
Pham TD, Zuegg J (2004) A probabilistic measure for alignment-free sequence comparison. Bioinformatics 20(18):3455–3461
Article Google Scholar
Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804
Article Google Scholar
Schiffman SS, Reynolds ML, Young FW (1981) Introduction to multidimensional scaling: theory, methods, and applications. Erlbaum Associates, New York
MATH Google Scholar
Schwaighofer A, Tresp V, Yu K (2004) Learning gaussian process kernels via hierarchical bayes. In: NIPS, pp 1209–1216
Shi Y, Lan Z, Liu W, Bi W (2009) Extending semi-supervised learning methods for inductive transfer learning. In: ICDM, pp 483–492
Skibinski P, Grabowski S, Deorowicz S (2005) Revisiting dictionary-based compression. Softw Pract Exp 35:1455–1476
Article Google Scholar
Slonim N, Tishby N (2000) Document Clustering Using word clusters via the information bottleneck method. In: SIGIR, pp 208–215
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, pp 25–36
Thrun S, Pratt LY (eds) (1998) Learning to learn. Kluwer, Boston
Vinga S, Almeida JS (2003) Alignment-free sequence comparison-a review. Bioinformatics 19(4):513–523
Article Google Scholar
Vitanyi PMB, Balbach FJ, Cilibrasi R, Li M (2008) Normalized information distance. In: Information theory and statistical, learning, pp 45–82
Welch TA (1984) A technique for high-performance data compression. Computer 17(6):8–19
Article Google Scholar
Xing Z, Pei J, Keogh EJ (2010) A brief survey on sequence classification. SIGKDD Explor 12(1):40–48
Article Google Scholar
Yu K, Schwaighofer A, Tresp V (2003) Collaborative ensemble learning: combining collaborative and content-based information filtering via hierarchical bayes. In: UAI, pp 616–623
Zhang J, Ghahramani Z, Yang Y (2006) Learning multiple related tasks using latent independent component analysis. In: NIPS, pp 1585–1592
Zhang J, Zhang C (2010) Multitask bregman clustering. In: AAAI, pp 655–660
Zheng VW, Pan SJ, Yang Q, Pan JJ (2008) Transferring multi-device localization models using latent multi-task learning. In: AAAI, pp 1427–1432

Download references

Acknowledgments

A part of this research is supported by the Strategic International Cooperative Program funded by Japan Science and Technology Agency (JST) and by the grant-in-aid for scientific research on fundamental research (B) 21300053 from the Japanese Ministry of Education, Culture, Sports, Science and Technology. Thach Nguyen Huy is sponsored by the JICA Scholarship.

Author information

Authors and Affiliations

Graduate School of System Life Sciences, Kyushu University, Fukuoka, Japan
Thach Nguyen Huy, Hao Shao, Bin Tong & Einoshin Suzuki
Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
Einoshin Suzuki

Authors

Thach Nguyen Huy
View author publications
You can also search for this author in PubMed Google Scholar
Hao Shao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Tong
View author publications
You can also search for this author in PubMed Google Scholar
Einoshin Suzuki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thach Nguyen Huy.

Appendix

1.1 An example of parsing a string by LZW algorithm

Given a string x = ‘abcabcabc,’ LZW parsing is given in Table 6. In the initial step, the dictionary $ \Omega $ is assigned to the empty set, string $ w $ is also assigned to the empty string, and $ Comp(x) = 0 $. In the first iteration, the first character ‘a’ is read and stored in the variable C = ‘a’. In this iteration, ‘$wC$’ = ‘$a$’ is not in the dictionary; therefore, ‘’ is added to the dictionary $\Omega $ = {‘a’} , $Comp(x) = 1 ,\, w$’ is still the empty string. In the second iteration, C = ‘$b$’ , ‘$wC$’ = ‘$b$’ is also not in the dictionary, so ‘$b$’ is added to the dictionary $\Omega $ = {‘$a$’, ‘$b$’}. This process continues until the last character of ‘$x$’, and we obtain $\Omega (\mathbf{x}) = \{\mathbf{a, b, c, ab, ca, bc}\} $ and $Comp({\mathbf{x}}) = 6 $ as shown in Table 6.

Table 6 Parsing string x = ‘abcabcabc’ by LZW algorithm

Full size table

1.2 Summary of related theorems

In this Section, we summarize theorems of other works that are used in this paper.

1.
Definition 3.2 of [16]: Given a compressor $ Comp $ and two strings $ \mathbf{x} $ and $ \mathbf{y} $, the conditional compressed information, $ Comp(\mathbf{y}|\mathbf{x}) $, is defined as follows:
$$\begin{aligned} {Comp(\mathbf{y}|\mathbf{x})} = {Comp(\mathbf{xy})} - {Comp(\mathbf{x})} \end{aligned}$$
(9)
$Comp(\mathbf{y}|\mathbf{x}) $ is considered as the number of bits of information in $ \mathbf{y} $, relative to $ \mathbf{x} $.
2.
Theorem 3.3 of [55]:
$$\begin{aligned} {K(\mathbf{xy}} \le {K(\mathbf{x})} + {K(\mathbf{y})} + \mathcal O (1) \end{aligned}$$
(10)
where $ K $ is the Kolmogorov complexity, and $ K(\mathbf{x}) ,\, K(\mathbf{y}) $, and $ K(\mathbf{xy}) $ are the lengths of the shortest programs that output $ \mathbf{x} ,\, \mathbf{y} $, and $ \mathbf{x} $ concatenated by $ \mathbf{y} $. Proof. Refer to Theorem 3.3, page 43 of [55].
3.
Remark II.1 of [34]: Given a string $ \mathbf{x} ,\, K(\mathbf{x}) $ is the length of the shortest binary program $ x^* $ to compute $ \mathbf{x} $ on an appropriate universal computer, such as the universal Turing machine. We require that $ \mathbf{x} $ can be decompressed from its compressed version $ x^* $ by a general decompressor program, but we do not require that $ \mathbf{x} $ can be compressed to $x$ by a general compressor program. In fact, it is easy to prove that there does not exist such a compressor program, since $ K(\mathbf{x}) $ is a non-computable function. Thus, $ K(\mathbf{x}) $ serves as the ultimate lower bound of what a real-world compressor can possibly achieve.
4.
Theorem VI.2 of [34]: The normalized information distance $ {d_k(\mathbf{x}, \mathbf{y})} $ is a lower bound of every normalized compression distance $ NCD(\mathbf{x}, \mathbf{y}) $ by $ {d_k(\mathbf{x}, \mathbf{y})} \le {NCD(\mathbf{x}, \mathbf{y})} + O(1/K) $ where $ K = min\{{K(\mathbf{x})}, K(\mathbf{y})\} $. Proof. Refer to Theorem VI.2 in page 6 of [34]. Note that in [34], the authors used $ d(x, y) $ and $ f(x, y) $ to refer to $ d_k(x, y) $ and $ NCD(x, y) $, respectively.

1.3 Pseudocode of the k-medoids Algorithm

The pseudocode of the k-medoids algorithm is given in Algorithm 3. It includes three steps. Firstly, each data instance is assigned to the closest medoid based on the dissimilarity value given in the distance matrix $D$ (Lines 4–5). Then, the algorithm replaces the current medoid in each cluster by the instance which minimizes the total distance to other instances in the considering cluster (Lines 9–11). This process is repeated until there is no change in the medoids.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huy, T.N., Shao, H., Tong, B. et al. A feature-free and parameter-light multi-task clustering framework. Knowl Inf Syst 36, 251–276 (2013). https://doi.org/10.1007/s10115-012-0550-5

Download citation

Received: 30 October 2011
Revised: 08 June 2012
Accepted: 22 August 2012
Published: 11 September 2012
Issue Date: July 2013
DOI: https://doi.org/10.1007/s10115-012-0550-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A feature-free and parameter-light multi-task clustering framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Discriminative multi-view K-means clustering with feature selection and group sparsity learning

Feature decomposition and structural learning for multi-diverse and multi-view data clustering

Multi-view Clustering Based on Non-negative Matrix Factorization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 An example of parsing a string by LZW algorithm

1.2 Summary of related theorems

1.3 Pseudocode of the k-medoids Algorithm

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now