Abstract
Clustering is a representative grouping process to find out hidden information and understand the characteristics of dataset to get a view of the further analysis. The concept of similarity and dissimilarity of objects is a fundamental decisive factor for clustering and the measure of them dominates the quality of results. When attributes of data are categorical, it is not simple to quantify the dissimilarity of data objects that have unimportant attributes or synonymous values. We suggest a new idea to quantify dissimilarity of objects by using distribution information of data correlated to each categorical value. Our method discovers intrinsic relationship of values and measures dissimilarity of objects effectively. Our approach does not couple with a clustering algorithm tightly and so can be applied various algorithms flexibly. Experiments on both synthetic and real datasets show propriety and effectiveness of this method. When our method is applied only to traditional clustering algorithms, the results are considerably improved than those of previous methods.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Myatt, G.J.: Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining. John Wiley & Sons, Inc., Chichester (2007)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Ganti, V., Gehrke, J., Ramakrishnan, R.: Cactus-clustering categorical data using summaries. In: Proc. of ACM SIGKDD, pp. 73–83 (1999)
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. In: Information Systems, pp. 512–521 (1999)
Zhang, Y., Fu, A.W.C., Cai, C.H., Heng, P.A.: Clustering categorical data. In: ICDE, p. 305 (2000)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)
Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning 10, 57–78 (1993)
Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognition Letters 28(1), 110–118 (2007)
Barbará, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Kalpakis, K., Goharian, N., Grossmann, D. (eds.) Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM 2002), November 4–9, pp. 582–589. ACM Press, New York (2002)
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: Scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative evaluation. In: SDM, pp. 243–254. SIAM, Philadelphia (2008)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Research Issues on Data Mining and Knowledge Discovery, pp. 1–8 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, J., Lee, YJ., Park, M. (2009). Clustering with Domain Value Dissimilarity for Categorical Data. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2009. Lecture Notes in Computer Science(), vol 5633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03067-3_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-03067-3_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03066-6
Online ISBN: 978-3-642-03067-3
eBook Packages: Computer ScienceComputer Science (R0)