Partition of Interval-Valued Observations Using Regression | Journal of Classification Skip to main content
Log in

Partition of Interval-Valued Observations Using Regression

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Both regression modeling and clustering methodologies have been extensively studied as separate techniques. There has been some activity in using regression-based algorithms to partition a data set into clusters for classical data; we propose one such algorithm to cluster interval-valued data. The new algorithm is based on the k-means algorithm of MacQueen (1967) and the dynamical partitioning method of Diday and Simon (1976), with the partitioning criteria being based on establishing regression models for each sub-cluster. This also depends on distance measures between the underlying regression models for each sub-cluster. Several types of simulated data sets are generated for several different data structures. The proposed k-regressions algorithm consistently out-performs the k-means algorithm. Elbow plots are used to identify the total number of clusters K in the partition. The new method is also applied to real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Anderberg, M.R. (1973). Cluster analysis for applications. New York: Academic Press.

    MATH  Google Scholar 

  • Batagelj, V., Kejžar, N., & Korenjak-Černe, S. (2015). Clustering of modal valued symbolic data. Machine Learnin. arXiv:1507.06683.

  • Bertrand, P., & Goupil, F. (2000). Descriptive statistics for symbolic data. In H.-H. Bock E. Diday (Eds.) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data (pp. 103–124). Berlin: Springer.

  • Billard, L. (2011). Brief overview of symbolic data and analytic issues. Statistical Analysis and Data Mining, 4, 149–156.

    Article  MathSciNet  Google Scholar 

  • Billard, L. (2014). The past’s present is now. What will the present’s future bring? In X. Lin, C. Genest, D.L. Banks, G. Molenberghs, D.W. Scott, & J.-L. Wang (Eds.) Past, present, and future of statistical science (pp. 323–334). New York: Chapman and Hall.

  • Billard, L., & Diday, E. (2000). Regression analysis for interval-valued data. In H.A.L. Kiers, J.-P. Rasson, P.J.F. Groenen, & M. Schader (Eds.) Data analysis, classification, and related methods (pp. 369–374). Springer.

  • Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal American Statistical Association, 98, 470–487.

    Article  MathSciNet  Google Scholar 

  • Billard, L., & Diday, E. (2006). Symbolic data analysis: conceptual statistics and data mining. Chichester: Wiley.

    Book  Google Scholar 

  • Bock, H.-H. (2007). Clustering methods: A history of k-means algorithms. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 161–172). Berlin: Springer.

  • Bock, H.-H. (2008). Origins and extensions of the k-means algorithm in cluster analysis. Journal Électronique d’Histoire des Probabilités et Statistics, 4, 1–18.

    MathSciNet  Google Scholar 

  • Bock, H.-H., & Diday, E. (2000). Analysis of symbolic data: Exploratory methods for extracting statistical information from complex data. Berlin: Springer.

    Book  Google Scholar 

  • Bougeard, S., Abdi, H., Saporta, G., & Niang, N. (2018). Clusterwise analysis for multiblock component methods. Advances in Data and Analysis of Classification, 12, 285–313.

    Article  MathSciNet  Google Scholar 

  • Bougeard, S., Cariou, V., Saporta, G., & Niang, N. (2017). Prediction for regularized clusterwise multiblock regression. Applied Stochastic Models for Business and Industry, 34, 852–867.

    Article  MathSciNet  Google Scholar 

  • Brusco, M.J., Cradit, J.D., Steinley, D., & Fox, G.L. (2008). Cautionary remarks on the use of clusterwise regression. Multivariate Behavioral Research, 43, 29–49.

    Article  Google Scholar 

  • Charles, C. (1977). Regression typologique et reconnaissance des formes thèse de 3ème cycle. Université de, Paris, Dauphine.

  • Chavent, M., Lechevallier, Y., Jajuga, K., Sokolowski, A., & Bock, H.-H. (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on Hausdorff distance. In Classification, clustering, and data analysis (pp. 53–60). Berlin: Springer.

  • Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society A, 134, 321–367.

    Article  MathSciNet  Google Scholar 

  • de Carvalho, F.A.T., Lima Neto, E.A., & Tenorio, C.P. (2004a). A new method to fit a linear regression model for interval-valued data. In Lecture notes in computer science, KI2004 advances in artificial intelligence (pp. 295–306). Springer.

  • de Carvalho, F.A.T., de Souza, R.M.C.R., & Silva, F.C.D. (2004b). A clustering method for symbolic interval-type data using adaptive Chebyshev distances. In A.L.C. Bazzan S. Labidi (Eds.) LNAI 3171 (pp. 266–275). Berlin: Springer.

  • de Carvalho, F.A.T., Brito, M.P., & Bock, H.-H. (2006). Dynamic clustering for interval data based on l2 distance. Computational Statistics, 21, 231–250.

    Article  MathSciNet  Google Scholar 

  • de Carvalho, F.A.T., & Lechevallier, Y. (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42, 1223–1236.

    Article  Google Scholar 

  • de Carvalho, F.A.T., Saporta, G., & Queiroz, D.N. (2010). A clusterwise center and range regression model for interval-valued data. In Y. Lechevallier G. Saporta (Eds.) Proceedings in computational statistics COMPSTAT 2010 (pp. 461–468). Berlin: Springer.

  • DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5, 249–282.

    Article  MathSciNet  Google Scholar 

  • de Souza, R.M.C.R., & de Carvalho, F.A.T. (2004). Clustering of interval data based on city-block distances. Pattern Recognition Letters, 25, 353–365.

    Article  Google Scholar 

  • de Souza, R.M.C.R., de Carvalho, F.A.T., Tenóio, C.P., & Lechevallier, Y. (2004). Dynamic cluster methods for interval data based on Mahalanobis distances. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.) Classification, clustering, and data analysis (pp. 251–360). Berlin: Springer.

  • Diday, E. (1971a). Une nouvelle méthode de classification automatique et reconnaissance des formes: la méthode des nuées dynamiques. Revue de Statistique Appliquée, 2, 19–33.

    Google Scholar 

  • Diday, E. (1971b). La méthode des nuées dynamiques. Revue de Statistique Appliquée, 19, 19–34.

    MathSciNet  Google Scholar 

  • Diday, E. (1987). Introduction à l’approche symbolique en analyse des données. Premier Jouneles Symbolique-Numerique, CEREMADE, Universite Paris - Dauphine, 21–56.

  • Diday, E. (2016). Thinking by classes in data science: The symbolic data analysis paradigm. WIRES Computational Statistics, 8, 172–205.

    Article  MathSciNet  Google Scholar 

  • Diday, E., & Noirhomme-Fraiture, M. (2008). Symbolic data analysis and the SODAS software. Chichester: Wiley.

    MATH  Google Scholar 

  • Diday, E., & Simon, J.C. (1976). Clustering analysis. In K.S. Fu (Ed.) Digital pattern recognition (pp. 47–94). Berlin: Springer.

  • Draper, N.R., & Smith, H. (1966). Applied regression analysis. New York: Wiley.

    MATH  Google Scholar 

  • Hausdorff, F. (1937). Set theory (translated into English by J. R. Aumann 1957). New York: Chelsey.

    Google Scholar 

  • Irpino, A., Verde, R., & Lechevallier, Y. (2006). Dynamic clustering of histograms using Wasserstein metric. In A. Rizzi M. Vichi (Eds.) COMPSTAT 2006 (pp. 869–876). Berlin: Physica-Verlag.

  • Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651–666.

    Article  Google Scholar 

  • Jain, A.K., Murty, M.N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31, 263–323.

    Article  Google Scholar 

  • Johnson, R.A., & Wichern, D.W. (2007). Applied multivariate statistical analysis, 6th edn. New Jersey: Prentice-Hall.

    MATH  Google Scholar 

  • Korenjak-Černe, S., Batagelj, V. , & Pavešić, B. J. (2011). Clustering large data sets described with discrete distributions and its application on TIMSS data set. Statistical Analysis and Data Mining, 4, 199–215.

    Article  MathSciNet  Google Scholar 

  • Košmelj, K., & Billard, L. (2012). Mallows’l2 distance in some multivariate methods and its application to histogram-type data. Metodološki Zvezki, 9, 107–118.

    Google Scholar 

  • Leroy, B., Chouakria, A., Herlin, I., & Diday, E. (1996). Approche géométrique et classification pour la reconnaissance de visage. Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, 548–557.

  • Lima Neto, E.A., & de Carvalho, F.A.T. (2008). Centre and range method for fitting a linear regression model to symbolic interval data. Computational Statistics and Data Analysis, 52, 1500–1515.

    Article  MathSciNet  Google Scholar 

  • Lima Neto, E.A., de Carvalho, F.A.T., & Freire, E.S. (2005). Applying constrained linear aggression models to predict interval-valued data. In U. Furbach (Ed.) Lecture notes in computer science, KI: advances in artificial intelligence (pp. 92–106). Brelin: Springer.

  • Lima Neto, E.A., de Carvalho, F.A.T., & Tenorio, C.P. (2004). Univariate and multivariate linear regression methods to predict interval-valued features. In Lecture notes in computer science, AI 2004, advances in artificial intelligence (pp. 526–537). Berlin: Springer.

  • Liu, F. (2016). Cluster analysis for symbolic interval data using linear regression method. Doctoral Dissertation, University of Georgia.

  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L.M. LeCam J. Neyman (Eds.) Proceedings of the 5th berkeley symposium on mathematical statistics and probability, (Vol. 1 pp. 281–299). Berkeley: University of California Press.

  • Noirhomme-Fraiture, M., & Brito, M.P. (2011). Far beyond the classical data models: Symbolic data analysis. Statistical Analysis and Data Mining, 4, 157–170.

    Article  MathSciNet  Google Scholar 

  • Qian, G., & Wu, Y. (2011). Estimation and selection in regression clustering. European Journal of Pure and Applied Mathematics, 4, 455–466.

    MathSciNet  MATH  Google Scholar 

  • Rao, C.R., Wu, Y., & Shao, Q. (2007). An M-estimation-based procedure for determining the number of regression models in regression clustering. Journal of Applied Mathematics and Decision Sciences, Article ID 37475.

  • Shao, Q., & Wu, Y. (2005). A consistent procedure for determining the number of clusters in regression clustering. Journal of Statistical Planning and Inference, 135, 461–476.

    Article  MathSciNet  Google Scholar 

  • Späth, H. (1979). Algorithm 39 clusterwise linear regression. Computing, 22, 367–373.

    Article  MathSciNet  Google Scholar 

  • Späth, H. (1981). Correction to algorithm 39: clusterwise linear regression. Computing, 26, 275.

    Article  MathSciNet  Google Scholar 

  • Späth, H. (1982). A fast algorithm for clusterwise linear regression. Computing, 29, 175–181.

    Article  Google Scholar 

  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.

    Article  MathSciNet  Google Scholar 

  • Verde, R., & Irpino, A. (2007). Dynamic clustering of histogram data: Using the right metric. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer.

  • Wedel, M., & Kistemaker, C. (1989). Consumer benefit segmentation using clusterwise linear regression. International Journal of Research in Marketing, 6, 45–59.

    Article  Google Scholar 

  • Xu, W. (2010). Symbolic data analysis: interval-valued data regression. Doctoral Dissertation, University of Georgia.

  • Zhang, B. (2003). Regression clustering. In X. Wu, A. Tuzhilin, & J. Shavlik (Eds.) Proceedings third IEEE international conference on data mining (pp. 451–458). California: IEEE Computer Society Publishers.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Liu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 294 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, F., Billard, L. Partition of Interval-Valued Observations Using Regression. J Classif 39, 55–77 (2022). https://doi.org/10.1007/s00357-021-09394-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-021-09394-5

Keywords

Navigation