A feature-free and parameter-light multi-task clustering framework | Knowledge and Information Systems Skip to main content
Log in

A feature-free and parameter-light multi-task clustering framework

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The two last decades have witnessed extensive research on multi-task learning algorithms in diverse domains such as bioinformatics, text mining, natural language processing as well as image and video content analysis. However, all existing multi-task learning methods require either domain-specific knowledge to extract features or a careful setting of many input parameters. There are many disadvantages associated with prior knowledge requirements for feature extraction or parameter-laden approaches. One of the most obvious problems is that we may find a wrong or non-existent pattern because of poorly extracted features or incorrectly set parameters. In this work, we propose a feature-free and parameter-light multi-task clustering framework to overcome these disadvantages. Our proposal is motivated by the recent successes of Kolmogorov-based methods on various applications. However, such methods are only defined for single-task problems because they lack a mechanism to share knowledge between different tasks. To address this problem, we create a novel dictionary-based compression dissimilarity measure that allows us to share knowledge across different tasks effectively. Experimental results with extensive comparisons demonstrate the generality and the effectiveness of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://people.csail.mit.edu/jrennie/20Newsgroups/.

  2. Referenced theorems used in this work are summarized in Appendix 7.2.

  3. More details of this definition are given in Appendix 7.2.

  4. http://www.un.org/en/documents/udhr/.

  5. http://www.freelang.net/families/.

  6. http://www.ncbi.nlm.nih.gov/genbank/.

  7. http://www.wikipedia.org/.

  8. http://people.csail.mit.edu/jrennie/20Newsgroups/.

  9. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.

  10. A dendrogram lists all of the instances and indicates at what level of dissimilarity two clusters were joined.

References

  1. Agarwal D (2007) Detecting Anomalies in cross-classified streams: a Bayesian approach. Knowl Inf Syst (KAIS) 11(1):29–44

    Article  Google Scholar 

  2. Allison L, Stern L, Edgoose T, Dix TI (2000) Sequence complexity for biological sequence analysis. Comput Chem 24(1):43–55

    Google Scholar 

  3. Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486

    Article  Google Scholar 

  4. Ando RK, Zhang T, Bartlett P (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853

    MathSciNet  MATH  Google Scholar 

  5. Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. In: NIPS, pp 41–48

  6. Baralis E, Bruno G, Fiori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst (KAIS) 29(1):81–101

    Article  Google Scholar 

  7. Baxter J (1995) Learning internal representations. In: COLT: proceedings of the workshop on computational learning theory, pp 311–320

  8. Baxter J (2000) A model of inductive bias learning. J Artif Intell Res 12:149–198

    MathSciNet  MATH  Google Scholar 

  9. Benedetto D, Caglioti E, Loreto V (2002) Language trees and zipping. Phys Rev Lett 88(4):2–5

    Article  Google Scholar 

  10. Bhattacharya I, Godbole S, Joshi S, Verma A (2009) Cross-guided clustering: transfer of relevant supervision across domains for improved clustering. In: ICDM, pp 41–50

  11. Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: EMNLP, pp 120–128

  12. Campana BJL, Keogh E (2010) A compression based distance measure for texture. In: SDM, pp 850–861

  13. Caruana R (1997) Multitask learning. Mach Learn 28:41–75

    Article  Google Scholar 

  14. Chapelle O, Shivaswamy PK, Vadrevu S, Weinberger KQ, Zhang Y, Tseng BL (2010) Multi-task learning for boosting with application to web search ranking. In: KDD, pp 1189–1198

  15. Chen X, Kwong S, Li M (2000) A compression algorithm for DNA sequences and its applications in genome comparison. In: RECOMB, pp 52–61

  16. Cilibrasi R, Vitanyi P (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545

    Article  MathSciNet  Google Scholar 

  17. Cilibrasi R, Vitányi P, Wolf RD (2004) Algorithmic clustering of music based on string compression. Comput Music J 28:49–67

    Article  Google Scholar 

  18. Cilibrasi R, Vitányi PMB (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383

    Article  Google Scholar 

  19. Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York

    Book  MATH  Google Scholar 

  20. Dai W, Yang Q, Xue GR, Yu Y (2007) Boosting for transfer learning. In: ICML, pp 193–200

  21. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: KDD, pp 269–274

  22. Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637

    MathSciNet  MATH  Google Scholar 

  23. Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst (KAIS) 19(2):249–264

    Article  Google Scholar 

  24. Fodor IK (2002) A survey of dimension reduction techniques. Technical Report, US Department of Energy

  25. Gu Q, Li Z, Han J (2011) Learning a kernel for multi-task clustering. In: AAAI, pp 368–373

  26. Gu Q, Zhou J (2009) Learning the shared subspace for multi-task clustering and transductive transfer classification. In: ICDM, pp 159–168

  27. Huy TN, Shao H, Tong B, Suzuki E. Website for this paper. http://sites.google.com/site/kaishuy/

  28. Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst (KAIS) 25(1):35–55

    Article  Google Scholar 

  29. Jolliffe IT (2002) Principal component analysis, series: Springer series in statistics, 2nd edn. Springer, NY

    Google Scholar 

  30. Juba B (2006) Estimating relatedness via data compression. In: ICML, pp 441–448

  31. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20:359–392

    Article  MathSciNet  Google Scholar 

  32. Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: KDD, pp 206–215

  33. Lawrence ND, Platt JC (2004) Learning to learn with the informative vector machine. In: ICML, pp 65–72

  34. Li M, Chen X, Li X, Ma B, Vitanyi P (2003) The similarity metric. In: ACM-SIAM symposium on discrete algorithms, pp 863–872

  35. Liu Q, Liao X, Carin HL, Stack JR, Carin L (2009) Semisupervised multitask learning. IEEE Trans PAMI 31:1074–1086

    Article  Google Scholar 

  36. Liu Q, Xu Q, Zheng VW, Xue H, Cao Z, Yang Q (2010) Multi-task learning for cross-platform siRNA efficacy prediction: an In-silico study. BMC Bioinform 11:181

    Article  Google Scholar 

  37. Lovasz L, Plummer M (1986) Matching theory

  38. Lowenstern D, Hirsh H, Noordiwier M, Yianilos P (1995) DNA sequence classification using compression-based induction. Technical report, Center for Discrete Mathematics and Theoretical Computer Science

  39. Mahmud MM (2007) On universal transfer learning. In: ALT, pp 135–149

  40. Mahmud MM, Ray SR (2008) Transfer learning using Kolmogorov complexity: basic theory and empirical evaluations. In: NIPS, pp 985–992

  41. Ming L, Paul V (1997) An introduction to Kolmogorov complexity and its applications. Springer, New York

    MATH  Google Scholar 

  42. Molina LC, Belanche L, Nebot A (2002) Feature selection algorithms: a survey and experimental evaluation. In: ICDM, pp 306–313

  43. Mount DW (2004) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press, New York

    Google Scholar 

  44. Ozawa S, Roy A, Roussinov D (2009) A multitask learning model for online pattern recognition. IEEE Trans Neural Netw 20(3):430–445

    Article  Google Scholar 

  45. Pham TD, Zuegg J (2004) A probabilistic measure for alignment-free sequence comparison. Bioinformatics 20(18):3455–3461

    Article  Google Scholar 

  46. Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804

    Article  Google Scholar 

  47. Schiffman SS, Reynolds ML, Young FW (1981) Introduction to multidimensional scaling: theory, methods, and applications. Erlbaum Associates, New York

    MATH  Google Scholar 

  48. Schwaighofer A, Tresp V, Yu K (2004) Learning gaussian process kernels via hierarchical bayes. In: NIPS, pp 1209–1216

  49. Shi Y, Lan Z, Liu W, Bi W (2009) Extending semi-supervised learning methods for inductive transfer learning. In: ICDM, pp 483–492

  50. Skibinski P, Grabowski S, Deorowicz S (2005) Revisiting dictionary-based compression. Softw Pract Exp 35:1455–1476

    Article  Google Scholar 

  51. Slonim N, Tishby N (2000) Document Clustering Using word clusters via the information bottleneck method. In: SIGIR, pp 208–215

  52. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, pp 25–36

  53. Thrun S, Pratt LY (eds) (1998) Learning to learn. Kluwer, Boston

  54. Vinga S, Almeida JS (2003) Alignment-free sequence comparison-a review. Bioinformatics 19(4):513–523

    Article  Google Scholar 

  55. Vitanyi PMB, Balbach FJ, Cilibrasi R, Li M (2008) Normalized information distance. In: Information theory and statistical, learning, pp 45–82

  56. Welch TA (1984) A technique for high-performance data compression. Computer 17(6):8–19

    Article  Google Scholar 

  57. Xing Z, Pei J, Keogh EJ (2010) A brief survey on sequence classification. SIGKDD Explor 12(1):40–48

    Article  Google Scholar 

  58. Yu K, Schwaighofer A, Tresp V (2003) Collaborative ensemble learning: combining collaborative and content-based information filtering via hierarchical bayes. In: UAI, pp 616–623

  59. Zhang J, Ghahramani Z, Yang Y (2006) Learning multiple related tasks using latent independent component analysis. In: NIPS, pp 1585–1592

  60. Zhang J, Zhang C (2010) Multitask bregman clustering. In: AAAI, pp 655–660

  61. Zheng VW, Pan SJ, Yang Q, Pan JJ (2008) Transferring multi-device localization models using latent multi-task learning. In: AAAI, pp 1427–1432

Download references

Acknowledgments

A part of this research is supported by the Strategic International Cooperative Program funded by Japan Science and Technology Agency (JST) and by the grant-in-aid for scientific research on fundamental research (B) 21300053 from the Japanese Ministry of Education, Culture, Sports, Science and Technology. Thach Nguyen Huy is sponsored by the JICA Scholarship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thach Nguyen Huy.

Appendix

Appendix

1.1 An example of parsing a string by LZW algorithm

Given a string x = ‘abcabcabc,’ LZW parsing is given in Table 6. In the initial step, the dictionary \( \Omega \) is assigned to the empty set, string \( w \) is also assigned to the empty string, and \( Comp(x) = 0 \). In the first iteration, the first character ‘a’ is read and stored in the variable C = ‘a’. In this iteration, ‘\(wC\)’ = ‘\(a\)is not in the dictionary; therefore, ‘’ is added to the dictionary \(\Omega \) =  {‘a’} , \(Comp(x) = 1 ,\, w\)’ is still the empty string. In the second iteration, C = ‘\(b\)’ ,  ‘\(wC\)’ = ‘\(b\)’ is also not in the dictionary, so ‘\(b\)’ is added to the dictionary \(\Omega \) = {‘\(a\)’, ‘\(b\)’}. This process continues until the last character of ‘\(x\)’, and we obtain \(\Omega (\mathbf{x}) = \{\mathbf{a, b, c, ab, ca, bc}\} \) and \(Comp({\mathbf{x}}) = 6 \) as shown in Table 6.

Table 6 Parsing string x = ‘abcabcabc’ by LZW algorithm

1.2 Summary of related theorems

In this Section, we summarize theorems of other works that are used in this paper.

  1. 1.

    Definition 3.2 of [16]: Given a compressor \( Comp \) and two strings \( \mathbf{x} \) and \( \mathbf{y} \), the conditional compressed information, \( Comp(\mathbf{y}|\mathbf{x}) \), is defined as follows:

    $$\begin{aligned} {Comp(\mathbf{y}|\mathbf{x})} = {Comp(\mathbf{xy})} - {Comp(\mathbf{x})} \end{aligned}$$
    (9)

    \(Comp(\mathbf{y}|\mathbf{x}) \) is considered as the number of bits of information in \( \mathbf{y} \), relative to \( \mathbf{x} \).

  2. 2.

    Theorem 3.3 of [55]:

    $$\begin{aligned} {K(\mathbf{xy}} \le {K(\mathbf{x})} + {K(\mathbf{y})} + \mathcal O (1) \end{aligned}$$
    (10)

    where \( K \) is the Kolmogorov complexity, and \( K(\mathbf{x}) ,\, K(\mathbf{y}) \), and \( K(\mathbf{xy}) \) are the lengths of the shortest programs that output \( \mathbf{x} ,\, \mathbf{y} \), and \( \mathbf{x} \) concatenated by \( \mathbf{y} \). Proof. Refer to Theorem 3.3, page 43 of [55].

  3. 3.

    Remark II.1 of [34]: Given a string \( \mathbf{x} ,\, K(\mathbf{x}) \) is the length of the shortest binary program \( x^* \) to compute \( \mathbf{x} \) on an appropriate universal computer, such as the universal Turing machine. We require that \( \mathbf{x} \) can be decompressed from its compressed version \( x^* \) by a general decompressor program, but we do not require that \( \mathbf{x} \) can be compressed to \(x\) by a general compressor program. In fact, it is easy to prove that there does not exist such a compressor program, since \( K(\mathbf{x}) \) is a non-computable function. Thus, \( K(\mathbf{x}) \) serves as the ultimate lower bound of what a real-world compressor can possibly achieve.

  4. 4.

    Theorem VI.2 of [34]: The normalized information distance \( {d_k(\mathbf{x}, \mathbf{y})} \) is a lower bound of every normalized compression distance \( NCD(\mathbf{x}, \mathbf{y}) \) by \( {d_k(\mathbf{x}, \mathbf{y})} \le {NCD(\mathbf{x}, \mathbf{y})} + O(1/K) \) where \( K = min\{{K(\mathbf{x})}, K(\mathbf{y})\} \). Proof. Refer to Theorem VI.2 in page 6 of [34]. Note that in [34], the authors used \( d(x, y) \) and \( f(x, y) \) to refer to \( d_k(x, y) \) and \( NCD(x, y) \), respectively.

 

1.3 Pseudocode of the k-medoids Algorithm

The pseudocode of the k-medoids algorithm is given in Algorithm 3. It includes three steps. Firstly, each data instance is assigned to the closest medoid based on the dissimilarity value given in the distance matrix \(D\) (Lines 4–5). Then, the algorithm replaces the current medoid in each cluster by the instance which minimizes the total distance to other instances in the considering cluster (Lines 9–11). This process is repeated until there is no change in the medoids.

figure a3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huy, T.N., Shao, H., Tong, B. et al. A feature-free and parameter-light multi-task clustering framework. Knowl Inf Syst 36, 251–276 (2013). https://doi.org/10.1007/s10115-012-0550-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0550-5

Keywords

Navigation