Abstract
Subspace and projected clustering have emerged as a possible solution to the challenges associated with clustering in high-dimensional data. Numerous subspace and projected clustering techniques have been proposed in the literature. A comprehensive evaluation of their advantages and disadvantages is urgently needed. In this paper, we evaluate systematically state-of-the-art subspace and projected clustering techniques under a wide range of experimental settings. We discuss the observed performance of the compared techniques, and we make recommendations regarding what type of techniques are suitable for what kind of problems.
Similar content being viewed by others
References
Achtert E, Böhm C, Kriegel H-P, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Ramamohanarao K, Krishna P, Mohania M, Nantajeewarawat E (eds) Proceedings of the 12th international conference on database systems for advanced applications (DASFAA), Bangkok, Thailand, 2007, pp 152–163
Achtert E, Kriegel H-P, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Proceedings of the 20th international conference on scientific and statistical database management (SSDBM), Hong Kong, China
Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Bussche J, Vianu V (eds) Proceedings of the eighth international conference on database theory (ICDT), London, UK, 2001, pp 420–434
Aggarwal C, Procopiuc C, Wolf J, Yu P, Park J (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh, S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, PA, USA, 1999, pp 61–72
Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, TX, USA, 2000, pp 70–81
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high-dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Seattle, WA, USA, 1998, pp 94–105
Agrawal R, Srikan R (1994) Fast algorithms for mining association rules in large databases. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, 1994, pp 487–499
Ankerst M, Breunig M, Kriegel H-P, Sander J (1999) OPTICS: Ordering points to identify the clustering structure. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, USA, 1999, pp 49–60
Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the seventh international conference on data mining (ICDM), Omaha, NE, USA, 2007, pp 409–414
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: Beeri C, Buneman P (eds) Proceedings of the seventh international conference on database theory (ICDT), Jerusalem, Israel, 1999, pp 217–235
Böhm C, Kailing K, Kriegel H-P, Kröger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the fourth international conference on data mining (ICDM), Brighton, UK, 2004, pp 27–34
Cheng C, Fu A, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, USA, 1999, pp 84–93
Dempster A, Laird N, Rubin D (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1): 1–38
Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining (SDM), San Francisco, CA, USA, 2003
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second ACM international conference on knowledge discovery and data mining (KDD), Portland, OR, USA, 1996, pp 226–231
Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1): 1–116
Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Academic Press, London
Hinneburg A, Aggarwal C, Keim D (2000) What is the nearest neighbor in high dimensional spaces? In: Abbadi A, Brodie M, Chakravarthy, Dayal U, Kamel N, Schlageter G, Whang K (eds) Proceedings of the 26th international conference on very large data bases (VLDB), Cairo, Egypt, 2000, pp 506–515
Hinneburg A, Keim D (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Kailing K, Kriegel H-P, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. In: Berry M, Dayal U, Kamath C, Skilicorn D (eds) Proceedings of the fourth SIAM international conference on data mining (SDM), Orlando, FL, USA, 2004, pp 1–11
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analyis. Wiley, New York
Kriegel H-P, Kröger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the fifth international conference on data mining, Houston, TX, USA, 2005, pp 250–257
Kriegel H-P, Kröger P, Zimek A (2009) Clustering high dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1)
Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inf Syst 17(1): 1–133
Liu G, Li J, Sim K, Wong L (2007) Distance based subspace clustering with flexible dimension partitioning. In: Proceedings of the 23rd international conference on data engineering (ICDE), Istanbul, Turkey, 2007, pp 1250–1254
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematics, statistics, and probabilistics, vol 1, 1967, pp 281–297
Moise G, Sander J, Ester M (2006) P3C: A robust projected clustering algorithm. In: Proceedings of the sixth international conference on data mining (ICDM), Hong Kong, China, 2006, pp 414–425
Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3): 273–298
Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD), Las Vegas, NV, USA, 2008, pp 533–541
Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the first SIAM international conference on data mining (SDM), Chicago, IL, USA, 2001, pp 1–17
Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases 1998
Ng K, Fu A, Wong C (2005) Projective clustering by histograms. IEEE Trans Knowl Data Eng 17(3): 369–383
Ng R, Han J(1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, 1994, pp 487–499
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105
Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18(7): 902–916
Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Franklin M, Moon B, Ailamaki, A (eds) Proceedings of the ACM international conference on management of data (SIGMOD, Madison, WI, USA, 2002, pp 418–427
R project http://www.r-project.org/
UCI Machine Learning Repository http://archive.ics.uci.edu/ml/
Sequeira K, Zaki M (2004) SCHISM: a new approach for interesting subspace mining. In: Proceedings of the fourth international conference on data mining (ICDM), Brighton, UK, 2004, pp 186–193
Tang J, Chen J, Fu A, Cheung W (2007) Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11(1): 45–84
Yip K, Cheung D, Ng M (2004) HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11): 1387–1397
Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: Proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, 2005, pp 329–340
Yip K, Qi P, Schultz M, Cheung D, Cheung K (2006) SemBiosphere: a semantic web approach to recommending microarray clustering services. In: Proceedings of the 11th pacific symposium on biocomputing (PSB), Maui, HI, USA, 2006
Yiu M, Mamoulis N (2005) Iterative projected clustering by subspace mining. IEEE Trans Knowl Data Eng 17(2): 176–189
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Moise, G., Zimek, A., Kröger, P. et al. Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21, 299–326 (2009). https://doi.org/10.1007/s10115-009-0226-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0226-y