Abstract
The two last decades have witnessed extensive research on multi-task learning algorithms in diverse domains such as bioinformatics, text mining, natural language processing as well as image and video content analysis. However, all existing multi-task learning methods require either domain-specific knowledge to extract features or a careful setting of many input parameters. There are many disadvantages associated with prior knowledge requirements for feature extraction or parameter-laden approaches. One of the most obvious problems is that we may find a wrong or non-existent pattern because of poorly extracted features or incorrectly set parameters. In this work, we propose a feature-free and parameter-light multi-task clustering framework to overcome these disadvantages. Our proposal is motivated by the recent successes of Kolmogorov-based methods on various applications. However, such methods are only defined for single-task problems because they lack a mechanism to share knowledge between different tasks. To address this problem, we create a novel dictionary-based compression dissimilarity measure that allows us to share knowledge across different tasks effectively. Experimental results with extensive comparisons demonstrate the generality and the effectiveness of our proposal.
Similar content being viewed by others
Notes
References
Agarwal D (2007) Detecting Anomalies in cross-classified streams: a Bayesian approach. Knowl Inf Syst (KAIS) 11(1):29–44
Allison L, Stern L, Edgoose T, Dix TI (2000) Sequence complexity for biological sequence analysis. Comput Chem 24(1):43–55
Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486
Ando RK, Zhang T, Bartlett P (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. In: NIPS, pp 41–48
Baralis E, Bruno G, Fiori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst (KAIS) 29(1):81–101
Baxter J (1995) Learning internal representations. In: COLT: proceedings of the workshop on computational learning theory, pp 311–320
Baxter J (2000) A model of inductive bias learning. J Artif Intell Res 12:149–198
Benedetto D, Caglioti E, Loreto V (2002) Language trees and zipping. Phys Rev Lett 88(4):2–5
Bhattacharya I, Godbole S, Joshi S, Verma A (2009) Cross-guided clustering: transfer of relevant supervision across domains for improved clustering. In: ICDM, pp 41–50
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: EMNLP, pp 120–128
Campana BJL, Keogh E (2010) A compression based distance measure for texture. In: SDM, pp 850–861
Caruana R (1997) Multitask learning. Mach Learn 28:41–75
Chapelle O, Shivaswamy PK, Vadrevu S, Weinberger KQ, Zhang Y, Tseng BL (2010) Multi-task learning for boosting with application to web search ranking. In: KDD, pp 1189–1198
Chen X, Kwong S, Li M (2000) A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, pp 52–61
Cilibrasi R, Vitanyi P (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545
Cilibrasi R, Vitányi P, Wolf RD (2004) Algorithmic clustering of music based on string compression. Comput Music J 28:49–67
Cilibrasi R, Vitányi PMB (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York
Dai W, Yang Q, Xue GR, Yu Y (2007) Boosting for transfer learning. In: ICML, pp 193–200
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: KDD, pp 269–274
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst (KAIS) 19(2):249–264
Fodor IK (2002) A survey of dimension reduction techniques. Technical Report, US Department of Energy
Gu Q, Li Z, Han J (2011) Learning a kernel for multi-task clustering. In: AAAI, pp 368–373
Gu Q, Zhou J (2009) Learning the shared subspace for multi-task clustering and transductive transfer classification. In: ICDM, pp 159–168
Huy TN, Shao H, Tong B, Suzuki E. Website for this paper. http://sites.google.com/site/kaishuy/
Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst (KAIS) 25(1):35–55
Jolliffe IT (2002) Principal component analysis, series: Springer series in statistics, 2nd edn. Springer, NY
Juba B (2006) Estimating relatedness via data compression. In: ICML, pp 441–448
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20:359–392
Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: KDD, pp 206–215
Lawrence ND, Platt JC (2004) Learning to learn with the informative vector machine. In: ICML, pp 65–72
Li M, Chen X, Li X, Ma B, Vitanyi P (2003) The similarity metric. In: ACM-SIAM symposium on discrete algorithms, pp 863–872
Liu Q, Liao X, Carin HL, Stack JR, Carin L (2009) Semisupervised multitask learning. IEEE Trans PAMI 31:1074–1086
Liu Q, Xu Q, Zheng VW, Xue H, Cao Z, Yang Q (2010) Multi-task learning for cross-platform siRNA efficacy prediction: an In-silico study. BMC Bioinform 11:181
Lovasz L, Plummer M (1986) Matching theory
Lowenstern D, Hirsh H, Noordiwier M, Yianilos P (1995) DNA sequence classification using compression-based induction. Technical report, Center for Discrete Mathematics and Theoretical Computer Science
Mahmud MM (2007) On universal transfer learning. In: ALT, pp 135–149
Mahmud MM, Ray SR (2008) Transfer learning using Kolmogorov complexity: basic theory and empirical evaluations. In: NIPS, pp 985–992
Ming L, Paul V (1997) An introduction to Kolmogorov complexity and its applications. Springer, New York
Molina LC, Belanche L, Nebot A (2002) Feature selection algorithms: a survey and experimental evaluation. In: ICDM, pp 306–313
Mount DW (2004) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press, New York
Ozawa S, Roy A, Roussinov D (2009) A multitask learning model for online pattern recognition. IEEE Trans Neural Netw 20(3):430–445
Pham TD, Zuegg J (2004) A probabilistic measure for alignment-free sequence comparison. Bioinformatics 20(18):3455–3461
Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804
Schiffman SS, Reynolds ML, Young FW (1981) Introduction to multidimensional scaling: theory, methods, and applications. Erlbaum Associates, New York
Schwaighofer A, Tresp V, Yu K (2004) Learning gaussian process kernels via hierarchical bayes. In: NIPS, pp 1209–1216
Shi Y, Lan Z, Liu W, Bi W (2009) Extending semi-supervised learning methods for inductive transfer learning. In: ICDM, pp 483–492
Skibinski P, Grabowski S, Deorowicz S (2005) Revisiting dictionary-based compression. Softw Pract Exp 35:1455–1476
Slonim N, Tishby N (2000) Document Clustering Using word clusters via the information bottleneck method. In: SIGIR, pp 208–215
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, pp 25–36
Thrun S, Pratt LY (eds) (1998) Learning to learn. Kluwer, Boston
Vinga S, Almeida JS (2003) Alignment-free sequence comparison-a review. Bioinformatics 19(4):513–523
Vitanyi PMB, Balbach FJ, Cilibrasi R, Li M (2008) Normalized information distance. In: Information theory and statistical, learning, pp 45–82
Welch TA (1984) A technique for high-performance data compression. Computer 17(6):8–19
Xing Z, Pei J, Keogh EJ (2010) A brief survey on sequence classification. SIGKDD Explor 12(1):40–48
Yu K, Schwaighofer A, Tresp V (2003) Collaborative ensemble learning: combining collaborative and content-based information filtering via hierarchical bayes. In: UAI, pp 616–623
Zhang J, Ghahramani Z, Yang Y (2006) Learning multiple related tasks using latent independent component analysis. In: NIPS, pp 1585–1592
Zhang J, Zhang C (2010) Multitask bregman clustering. In: AAAI, pp 655–660
Zheng VW, Pan SJ, Yang Q, Pan JJ (2008) Transferring multi-device localization models using latent multi-task learning. In: AAAI, pp 1427–1432
Acknowledgments
A part of this research is supported by the Strategic International Cooperative Program funded by Japan Science and Technology Agency (JST) and by the grant-in-aid for scientific research on fundamental research (B) 21300053 from the Japanese Ministry of Education, Culture, Sports, Science and Technology. Thach Nguyen Huy is sponsored by the JICA Scholarship.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 An example of parsing a string by LZW algorithm
Given a string x = ‘abcabcabc,’ LZW parsing is given in Table 6. In the initial step, the dictionary \( \Omega \) is assigned to the empty set, string \( w \) is also assigned to the empty string, and \( Comp(x) = 0 \). In the first iteration, the first character ‘a’ is read and stored in the variable C = ‘a’. In this iteration, ‘\(wC\)’ = ‘\(a\)’ is not in the dictionary; therefore, ‘’ is added to the dictionary \(\Omega \) = {‘a’} , \(Comp(x) = 1 ,\, w\)’ is still the empty string. In the second iteration, C = ‘\(b\)’ , ‘\(wC\)’ = ‘\(b\)’ is also not in the dictionary, so ‘\(b\)’ is added to the dictionary \(\Omega \) = {‘\(a\)’, ‘\(b\)’}. This process continues until the last character of ‘\(x\)’, and we obtain \(\Omega (\mathbf{x}) = \{\mathbf{a, b, c, ab, ca, bc}\} \) and \(Comp({\mathbf{x}}) = 6 \) as shown in Table 6.
1.2 Summary of related theorems
In this Section, we summarize theorems of other works that are used in this paper.
-
1.
Definition 3.2 of [16]: Given a compressor \( Comp \) and two strings \( \mathbf{x} \) and \( \mathbf{y} \), the conditional compressed information, \( Comp(\mathbf{y}|\mathbf{x}) \), is defined as follows:
$$\begin{aligned} {Comp(\mathbf{y}|\mathbf{x})} = {Comp(\mathbf{xy})} - {Comp(\mathbf{x})} \end{aligned}$$(9)\(Comp(\mathbf{y}|\mathbf{x}) \) is considered as the number of bits of information in \( \mathbf{y} \), relative to \( \mathbf{x} \).
-
2.
Theorem 3.3 of [55]:
$$\begin{aligned} {K(\mathbf{xy}} \le {K(\mathbf{x})} + {K(\mathbf{y})} + \mathcal O (1) \end{aligned}$$(10)where \( K \) is the Kolmogorov complexity, and \( K(\mathbf{x}) ,\, K(\mathbf{y}) \), and \( K(\mathbf{xy}) \) are the lengths of the shortest programs that output \( \mathbf{x} ,\, \mathbf{y} \), and \( \mathbf{x} \) concatenated by \( \mathbf{y} \). Proof. Refer to Theorem 3.3, page 43 of [55].
-
3.
Remark II.1 of [34]: Given a string \( \mathbf{x} ,\, K(\mathbf{x}) \) is the length of the shortest binary program \( x^* \) to compute \( \mathbf{x} \) on an appropriate universal computer, such as the universal Turing machine. We require that \( \mathbf{x} \) can be decompressed from its compressed version \( x^* \) by a general decompressor program, but we do not require that \( \mathbf{x} \) can be compressed to \(x\) by a general compressor program. In fact, it is easy to prove that there does not exist such a compressor program, since \( K(\mathbf{x}) \) is a non-computable function. Thus, \( K(\mathbf{x}) \) serves as the ultimate lower bound of what a real-world compressor can possibly achieve.
-
4.
Theorem VI.2 of [34]: The normalized information distance \( {d_k(\mathbf{x}, \mathbf{y})} \) is a lower bound of every normalized compression distance \( NCD(\mathbf{x}, \mathbf{y}) \) by \( {d_k(\mathbf{x}, \mathbf{y})} \le {NCD(\mathbf{x}, \mathbf{y})} + O(1/K) \) where \( K = min\{{K(\mathbf{x})}, K(\mathbf{y})\} \). Proof. Refer to Theorem VI.2 in page 6 of [34]. Note that in [34], the authors used \( d(x, y) \) and \( f(x, y) \) to refer to \( d_k(x, y) \) and \( NCD(x, y) \), respectively.
1.3 Pseudocode of the k-medoids Algorithm
The pseudocode of the k-medoids algorithm is given in Algorithm 3. It includes three steps. Firstly, each data instance is assigned to the closest medoid based on the dissimilarity value given in the distance matrix \(D\) (Lines 4–5). Then, the algorithm replaces the current medoid in each cluster by the instance which minimizes the total distance to other instances in the considering cluster (Lines 9–11). This process is repeated until there is no change in the medoids.
Rights and permissions
About this article
Cite this article
Huy, T.N., Shao, H., Tong, B. et al. A feature-free and parameter-light multi-task clustering framework. Knowl Inf Syst 36, 251–276 (2013). https://doi.org/10.1007/s10115-012-0550-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0550-5