Subspace and projected clustering: experimental evaluation and analysis | Knowledge and Information Systems Skip to main content

Advertisement

Log in

Subspace and projected clustering: experimental evaluation and analysis

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Subspace and projected clustering have emerged as a possible solution to the challenges associated with clustering in high-dimensional data. Numerous subspace and projected clustering techniques have been proposed in the literature. A comprehensive evaluation of their advantages and disadvantages is urgently needed. In this paper, we evaluate systematically state-of-the-art subspace and projected clustering techniques under a wide range of experimental settings. We discuss the observed performance of the compared techniques, and we make recommendations regarding what type of techniques are suitable for what kind of problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Achtert E, Böhm C, Kriegel H-P, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Ramamohanarao K, Krishna P, Mohania M, Nantajeewarawat E (eds) Proceedings of the 12th international conference on database systems for advanced applications (DASFAA), Bangkok, Thailand, 2007, pp 152–163

  2. Achtert E, Kriegel H-P, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Proceedings of the 20th international conference on scientific and statistical database management (SSDBM), Hong Kong, China

  3. Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Bussche J, Vianu V (eds) Proceedings of the eighth international conference on database theory (ICDT), London, UK, 2001, pp 420–434

  4. Aggarwal C, Procopiuc C, Wolf J, Yu P, Park J (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh, S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, PA, USA, 1999, pp 61–72

  5. Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, TX, USA, 2000, pp 70–81

  6. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high-dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Seattle, WA, USA, 1998, pp 94–105

  7. Agrawal R, Srikan R (1994) Fast algorithms for mining association rules in large databases. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, 1994, pp 487–499

  8. Ankerst M, Breunig M, Kriegel H-P, Sander J (1999) OPTICS: Ordering points to identify the clustering structure. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, USA, 1999, pp 49–60

  9. Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the seventh international conference on data mining (ICDM), Omaha, NE, USA, 2007, pp 409–414

  10. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: Beeri C, Buneman P (eds) Proceedings of the seventh international conference on database theory (ICDT), Jerusalem, Israel, 1999, pp 217–235

  11. Böhm C, Kailing K, Kriegel H-P, Kröger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the fourth international conference on data mining (ICDM), Brighton, UK, 2004, pp 27–34

  12. Cheng C, Fu A, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, USA, 1999, pp 84–93

  13. Dempster A, Laird N, Rubin D (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1): 1–38

    MATH  MathSciNet  Google Scholar 

  14. Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining (SDM), San Francisco, CA, USA, 2003

  15. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second ACM international conference on knowledge discovery and data mining (KDD), Portland, OR, USA, 1996, pp 226–231

  16. Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1): 1–116

    Article  Google Scholar 

  17. Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Academic Press, London

    Google Scholar 

  18. Hinneburg A, Aggarwal C, Keim D (2000) What is the nearest neighbor in high dimensional spaces? In: Abbadi A, Brodie M, Chakravarthy, Dayal U, Kamel N, Schlageter G, Whang K (eds) Proceedings of the 26th international conference on very large data bases (VLDB), Cairo, Egypt, 2000, pp 506–515

  19. Hinneburg A, Keim D (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415

    Article  Google Scholar 

  20. Kailing K, Kriegel H-P, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. In: Berry M, Dayal U, Kamath C, Skilicorn D (eds) Proceedings of the fourth SIAM international conference on data mining (SDM), Orlando, FL, USA, 2004, pp 1–11

  21. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analyis. Wiley, New York

    Google Scholar 

  22. Kriegel H-P, Kröger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the fifth international conference on data mining, Houston, TX, USA, 2005, pp 250–257

  23. Kriegel H-P, Kröger P, Zimek A (2009) Clustering high dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1)

  24. Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inf Syst 17(1): 1–133

    Article  Google Scholar 

  25. Liu G, Li J, Sim K, Wong L (2007) Distance based subspace clustering with flexible dimension partitioning. In: Proceedings of the 23rd international conference on data engineering (ICDE), Istanbul, Turkey, 2007, pp 1250–1254

  26. McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematics, statistics, and probabilistics, vol 1, 1967, pp 281–297

  27. Moise G, Sander J, Ester M (2006) P3C: A robust projected clustering algorithm. In: Proceedings of the sixth international conference on data mining (ICDM), Hong Kong, China, 2006, pp 414–425

  28. Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3): 273–298

    Article  MATH  Google Scholar 

  29. Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD), Las Vegas, NV, USA, 2008, pp 533–541

  30. Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the first SIAM international conference on data mining (SDM), Chicago, IL, USA, 2001, pp 1–17

  31. Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases 1998

  32. Ng K, Fu A, Wong C (2005) Projective clustering by histograms. IEEE Trans Knowl Data Eng 17(3): 369–383

    Article  Google Scholar 

  33. Ng R, Han J(1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, 1994, pp 487–499

  34. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105

    Article  Google Scholar 

  35. Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18(7): 902–916

    Article  Google Scholar 

  36. Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Franklin M, Moon B, Ailamaki, A (eds) Proceedings of the ACM international conference on management of data (SIGMOD, Madison, WI, USA, 2002, pp 418–427

  37. R project http://www.r-project.org/

  38. UCI Machine Learning Repository http://archive.ics.uci.edu/ml/

  39. Sequeira K, Zaki M (2004) SCHISM: a new approach for interesting subspace mining. In: Proceedings of the fourth international conference on data mining (ICDM), Brighton, UK, 2004, pp 186–193

  40. Tang J, Chen J, Fu A, Cheung W (2007) Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11(1): 45–84

    Article  Google Scholar 

  41. Yip K, Cheung D, Ng M (2004) HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11): 1387–1397

    Article  Google Scholar 

  42. Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: Proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, 2005, pp 329–340

  43. Yip K, Qi P, Schultz M, Cheung D, Cheung K (2006) SemBiosphere: a semantic web approach to recommending microarray clustering services. In: Proceedings of the 11th pacific symposium on biocomputing (PSB), Maui, HI, USA, 2006

  44. Yiu M, Mamoulis N (2005) Iterative projected clustering by subspace mining. IEEE Trans Knowl Data Eng 17(2): 176–189

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriela Moise.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moise, G., Zimek, A., Kröger, P. et al. Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21, 299–326 (2009). https://doi.org/10.1007/s10115-009-0226-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0226-y

Keywords

Navigation