Abstract
Both regression modeling and clustering methodologies have been extensively studied as separate techniques. There has been some activity in using regression-based algorithms to partition a data set into clusters for classical data; we propose one such algorithm to cluster interval-valued data. The new algorithm is based on the k-means algorithm of MacQueen (1967) and the dynamical partitioning method of Diday and Simon (1976), with the partitioning criteria being based on establishing regression models for each sub-cluster. This also depends on distance measures between the underlying regression models for each sub-cluster. Several types of simulated data sets are generated for several different data structures. The proposed k-regressions algorithm consistently out-performs the k-means algorithm. Elbow plots are used to identify the total number of clusters K in the partition. The new method is also applied to real data.
Similar content being viewed by others
References
Anderberg, M.R. (1973). Cluster analysis for applications. New York: Academic Press.
Batagelj, V., Kejžar, N., & Korenjak-Černe, S. (2015). Clustering of modal valued symbolic data. Machine Learnin. arXiv:1507.06683.
Bertrand, P., & Goupil, F. (2000). Descriptive statistics for symbolic data. In H.-H. Bock E. Diday (Eds.) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data (pp. 103–124). Berlin: Springer.
Billard, L. (2011). Brief overview of symbolic data and analytic issues. Statistical Analysis and Data Mining, 4, 149–156.
Billard, L. (2014). The past’s present is now. What will the present’s future bring? In X. Lin, C. Genest, D.L. Banks, G. Molenberghs, D.W. Scott, & J.-L. Wang (Eds.) Past, present, and future of statistical science (pp. 323–334). New York: Chapman and Hall.
Billard, L., & Diday, E. (2000). Regression analysis for interval-valued data. In H.A.L. Kiers, J.-P. Rasson, P.J.F. Groenen, & M. Schader (Eds.) Data analysis, classification, and related methods (pp. 369–374). Springer.
Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal American Statistical Association, 98, 470–487.
Billard, L., & Diday, E. (2006). Symbolic data analysis: conceptual statistics and data mining. Chichester: Wiley.
Bock, H.-H. (2007). Clustering methods: A history of k-means algorithms. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 161–172). Berlin: Springer.
Bock, H.-H. (2008). Origins and extensions of the k-means algorithm in cluster analysis. Journal Électronique d’Histoire des Probabilités et Statistics, 4, 1–18.
Bock, H.-H., & Diday, E. (2000). Analysis of symbolic data: Exploratory methods for extracting statistical information from complex data. Berlin: Springer.
Bougeard, S., Abdi, H., Saporta, G., & Niang, N. (2018). Clusterwise analysis for multiblock component methods. Advances in Data and Analysis of Classification, 12, 285–313.
Bougeard, S., Cariou, V., Saporta, G., & Niang, N. (2017). Prediction for regularized clusterwise multiblock regression. Applied Stochastic Models for Business and Industry, 34, 852–867.
Brusco, M.J., Cradit, J.D., Steinley, D., & Fox, G.L. (2008). Cautionary remarks on the use of clusterwise regression. Multivariate Behavioral Research, 43, 29–49.
Charles, C. (1977). Regression typologique et reconnaissance des formes thèse de 3ème cycle. Université de, Paris, Dauphine.
Chavent, M., Lechevallier, Y., Jajuga, K., Sokolowski, A., & Bock, H.-H. (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on Hausdorff distance. In Classification, clustering, and data analysis (pp. 53–60). Berlin: Springer.
Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society A, 134, 321–367.
de Carvalho, F.A.T., Lima Neto, E.A., & Tenorio, C.P. (2004a). A new method to fit a linear regression model for interval-valued data. In Lecture notes in computer science, KI2004 advances in artificial intelligence (pp. 295–306). Springer.
de Carvalho, F.A.T., de Souza, R.M.C.R., & Silva, F.C.D. (2004b). A clustering method for symbolic interval-type data using adaptive Chebyshev distances. In A.L.C. Bazzan S. Labidi (Eds.) LNAI 3171 (pp. 266–275). Berlin: Springer.
de Carvalho, F.A.T., Brito, M.P., & Bock, H.-H. (2006). Dynamic clustering for interval data based on l2 distance. Computational Statistics, 21, 231–250.
de Carvalho, F.A.T., & Lechevallier, Y. (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42, 1223–1236.
de Carvalho, F.A.T., Saporta, G., & Queiroz, D.N. (2010). A clusterwise center and range regression model for interval-valued data. In Y. Lechevallier G. Saporta (Eds.) Proceedings in computational statistics COMPSTAT 2010 (pp. 461–468). Berlin: Springer.
DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5, 249–282.
de Souza, R.M.C.R., & de Carvalho, F.A.T. (2004). Clustering of interval data based on city-block distances. Pattern Recognition Letters, 25, 353–365.
de Souza, R.M.C.R., de Carvalho, F.A.T., Tenóio, C.P., & Lechevallier, Y. (2004). Dynamic cluster methods for interval data based on Mahalanobis distances. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.) Classification, clustering, and data analysis (pp. 251–360). Berlin: Springer.
Diday, E. (1971a). Une nouvelle méthode de classification automatique et reconnaissance des formes: la méthode des nuées dynamiques. Revue de Statistique Appliquée, 2, 19–33.
Diday, E. (1971b). La méthode des nuées dynamiques. Revue de Statistique Appliquée, 19, 19–34.
Diday, E. (1987). Introduction à l’approche symbolique en analyse des données. Premier Jouneles Symbolique-Numerique, CEREMADE, Universite Paris - Dauphine, 21–56.
Diday, E. (2016). Thinking by classes in data science: The symbolic data analysis paradigm. WIRES Computational Statistics, 8, 172–205.
Diday, E., & Noirhomme-Fraiture, M. (2008). Symbolic data analysis and the SODAS software. Chichester: Wiley.
Diday, E., & Simon, J.C. (1976). Clustering analysis. In K.S. Fu (Ed.) Digital pattern recognition (pp. 47–94). Berlin: Springer.
Draper, N.R., & Smith, H. (1966). Applied regression analysis. New York: Wiley.
Hausdorff, F. (1937). Set theory (translated into English by J. R. Aumann 1957). New York: Chelsey.
Irpino, A., Verde, R., & Lechevallier, Y. (2006). Dynamic clustering of histograms using Wasserstein metric. In A. Rizzi M. Vichi (Eds.) COMPSTAT 2006 (pp. 869–876). Berlin: Physica-Verlag.
Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651–666.
Jain, A.K., Murty, M.N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31, 263–323.
Johnson, R.A., & Wichern, D.W. (2007). Applied multivariate statistical analysis, 6th edn. New Jersey: Prentice-Hall.
Korenjak-Černe, S., Batagelj, V. , & Pavešić, B. J. (2011). Clustering large data sets described with discrete distributions and its application on TIMSS data set. Statistical Analysis and Data Mining, 4, 199–215.
Košmelj, K., & Billard, L. (2012). Mallows’l2 distance in some multivariate methods and its application to histogram-type data. Metodološki Zvezki, 9, 107–118.
Leroy, B., Chouakria, A., Herlin, I., & Diday, E. (1996). Approche géométrique et classification pour la reconnaissance de visage. Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, 548–557.
Lima Neto, E.A., & de Carvalho, F.A.T. (2008). Centre and range method for fitting a linear regression model to symbolic interval data. Computational Statistics and Data Analysis, 52, 1500–1515.
Lima Neto, E.A., de Carvalho, F.A.T., & Freire, E.S. (2005). Applying constrained linear aggression models to predict interval-valued data. In U. Furbach (Ed.) Lecture notes in computer science, KI: advances in artificial intelligence (pp. 92–106). Brelin: Springer.
Lima Neto, E.A., de Carvalho, F.A.T., & Tenorio, C.P. (2004). Univariate and multivariate linear regression methods to predict interval-valued features. In Lecture notes in computer science, AI 2004, advances in artificial intelligence (pp. 526–537). Berlin: Springer.
Liu, F. (2016). Cluster analysis for symbolic interval data using linear regression method. Doctoral Dissertation, University of Georgia.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L.M. LeCam J. Neyman (Eds.) Proceedings of the 5th berkeley symposium on mathematical statistics and probability, (Vol. 1 pp. 281–299). Berkeley: University of California Press.
Noirhomme-Fraiture, M., & Brito, M.P. (2011). Far beyond the classical data models: Symbolic data analysis. Statistical Analysis and Data Mining, 4, 157–170.
Qian, G., & Wu, Y. (2011). Estimation and selection in regression clustering. European Journal of Pure and Applied Mathematics, 4, 455–466.
Rao, C.R., Wu, Y., & Shao, Q. (2007). An M-estimation-based procedure for determining the number of regression models in regression clustering. Journal of Applied Mathematics and Decision Sciences, Article ID 37475.
Shao, Q., & Wu, Y. (2005). A consistent procedure for determining the number of clusters in regression clustering. Journal of Statistical Planning and Inference, 135, 461–476.
Späth, H. (1979). Algorithm 39 clusterwise linear regression. Computing, 22, 367–373.
Späth, H. (1981). Correction to algorithm 39: clusterwise linear regression. Computing, 26, 275.
Späth, H. (1982). A fast algorithm for clusterwise linear regression. Computing, 29, 175–181.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.
Verde, R., & Irpino, A. (2007). Dynamic clustering of histogram data: Using the right metric. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer.
Wedel, M., & Kistemaker, C. (1989). Consumer benefit segmentation using clusterwise linear regression. International Journal of Research in Marketing, 6, 45–59.
Xu, W. (2010). Symbolic data analysis: interval-valued data regression. Doctoral Dissertation, University of Georgia.
Zhang, B. (2003). Regression clustering. In X. Wu, A. Tuzhilin, & J. Shavlik (Eds.) Proceedings third IEEE international conference on data mining (pp. 451–458). California: IEEE Computer Society Publishers.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Liu, F., Billard, L. Partition of Interval-Valued Observations Using Regression. J Classif 39, 55–77 (2022). https://doi.org/10.1007/s00357-021-09394-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-021-09394-5