Abstract
This paper introduces a principal component methodology for analysing histogram-valued data under the symbolic data domain. Currently, no comparable method exists for this type of data. The proposed method uses a symbolic covariance matrix to determine the principal component space. The resulting observations on principal component space are presented as polytopes for visualization. Numerical representation of the resulting polytopes via histogram-valued output is also presented. The necessary algorithms are included. The technique is illustrated on a weather data set.
Similar content being viewed by others
References
Anderson TW (1963) Asymptotic theory for principal components analysis. Ann Math Stat 34:122–148
Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. John Wiley, New York
Bertrand P and Goupil F (2000) Descriptive statistics for symbolic data. In: Bock H-H, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124
Billard L (2008) Sample covariance functions for complex quantitative data. In: Mizuta M, Nakano J (eds) Proceedings World Congress, International Association of Statistical Computing. Japanese Society of Computational Statistics, Japan, pp 157–163
Billard L (2011) Brief overview of symbolic data and analytic issues. Stat Anal Data Min 4:149–156
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98:470–487
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. John Wiley, Chichester
Billard L, Guo JH, Xu W (2011) Maximum Likelihood Estimators for Bivariate Interval-Valued Data. Technical Report, University of Georgia, Athens, GA, under revision
Billard L, Le-Rademacher J (2013) Symbolic principal components for interval-valued data. Revue des Nouvelles Technologies de l’Information 25:31–40
Bock HH, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin
Cazes P (2002) Analyse Factorielle d’un Tableau de Lois de Probabilité. Rev Stat Appl 50:5–24
Cazes P, Chouakria A, Diday E, Schecktman Y (1997) Extension de l’analyse en composantes principales \(\grave{a}\) des donn\(\acute{e}\)es de type intervalle. Rev Stat Appl 45:5–24
Chouakria A (1998) Extension des M\(\acute{e}\)thodes d’Analyse Factorielle \(\grave{a}\) des Donn\(\acute{e}\)es de Type Intervalle. Th\(\acute{e}\)se de doctorat. Universit\(\acute{e}\) Paris Dauphine, Paris
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4:229–246
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4:184–198
Irpino A, Lauro C, Verde R (2003) Visualizing symbolic data by closed shapes. In: Schader M, Gaul W, Vichi M (eds) Between Data Science and Applied Data Analysis. Springer, Berlin. pp 244–251
Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis, 5th edn. Prentice Hall, New Jersey
Jolliffe IT (2004) Principal component analysis, 2nd edn. Springer, New York
Lauro NC, Palumbo F (2000) Principal component analysis of interval data: a symbolic data analysis approach. Comput Stat 15:73–87
Lauro NC, Verde R and Irpino A (2008) Principal component analysis of symbolic data described by intervals. In: Diday E, Noirhomme-Fraiture M (eds) Symbolic Data Analysis and the SODAS Software. Wiley, Chichester. pp 279–311
Le-Rademacher J (2008) Principal Component Analysis for Interval-Valued and Histogram-Valued Data and Likelihood Functions and Some Maximum Likelihood Estimators for Symbolic Data. Doctoral Dissertation. University of Georgia
Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141:1593–1602
Le-Rademacher J, Billard L (2012) Symbolic-covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat 21:413–432
Le-Rademacher J, Billard L (2013) Principal component histograms from interval-valued observations. Comput Stat 28:2117–2138
Makosso-Kallyth S, Diday E (2012) Adaptation of interval PCA to symbolic histogram variables. Adv Data Anal Classif 6:147–159
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, New York
Palumbo F, Lauro NC (2003) A PCA for interval-valued data based on midpoints and radii. In: Yanai H, Okada A, Shigemasu K, Kano Y, Meulman J (eds) New Developments in Psychometrics. Springer, Tokyo. pp 641–648
Shapiro AF (2009) Fuzzy random variables. Insur Math Econ 44:307–314
Xu W (2010) Symbolic Data Analysis: Interval-Valued Data Regression. PhD thesis, University of Georgia
Zadeh LA (1965) Fuzzy Sets. Inf Control 8:338–353
Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23:421–427
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: Algorithm
Appendix: Algorithm
The algorithm to construct the polytope representation of the observations on principal component space has essentially two parts. The first part (“Constructing the matrix of vertices” in the Appendix) constructs the matrices of vertices needed to build the actual polytopes. Then (“Constructing the polytopes” in the Appendix) the construction of the polytopes per se is described. Extensions to two- and three-dimensional polytope plots are given in “Constructing Two and Three Dimensional Plots” in the Appendix. The algorithm to compute the histograms from the resulting polytopes is given in “Constructing the PC histograms” in the Appendix. The indexing notation used in these algorithms is similar to that of the R language. Therefore, the position for an element of a vector, a matrix or an array is specified in a pair of square brackets, [ ]. The index for an element of a vector is enclosed in the brackets. An element of a matrix is specified by a pair of numbers separated by a comma. The first number specifies the row and the second number specifies the column. The position of an array is specified by three numbers separated by commas corresponding to row, column, and matrix, respectively. Also, we use the lower case to represent an observed data matrix [e.g., \(\mathbf{x}_i^v\) to distinguish it from the random data matrix \(\mathbf{{X}}_i^v\) of Eq. (8)].
1.1 Constructing the matrix of vertices
First, assume that the observed data vector \(\mathbf{x}_i\) has been separated into a vector of subinterval endpoints and a vector of the relative frequencies. That is, let \(\mathbf{x}_{ep}\) be the vector of subinterval endpoints and \(\mathbf{x}_{rf}\) be the vector of subinterval relative frequencies. Then, \(\mathbf{x}_{ep}\) has \(\sum ^p_{j=1}{(s_{ij}+1)}\) elements and has the form
where \(a_{ijk}\), for \(k=1,\ldots , s_{ij}+1\) and \(j = 1, \ldots , p\), are elements of the set \(E_{ij}\). The vector \(\mathbf{x}_{rf}\) has \(\sum ^p_{j=1}{s_{ij}}\) elements and has the form
where \(p_{ijk}\) is the relative frequency of the k th subinterval of the observed histogram \(x_{ij}\). Before creating the matrix of vertices for observation i, a p-vector whose elements are the number of subintervals for \(X_{ij}\) is also needed. Let \(\mathbf{ns}\) denote the vector of number of subintervals of \(X_{ij}\). Then,
With the information in \(\mathbf{x}_{ep}\), \(\mathbf{x}_{rf}\), and \(\mathbf{ns}\), we can proceed with constructing the matrix of vertices \(\mathbf{x}^v_i\) using the following five steps:
\(\underline{Step~1:}\) Create a (\(p+1\))-vector \(\mathbf{nr}\) whose \((j+1)^{\mathrm{{th}}}\) element, for \(j=1,\ldots , p\), is the number of times that points \(a_{ijk}\), for \(k = 1, \ldots , s_{ij}+1\), must be repeated in Step 5 below. The first element of \(\mathbf{nr}\) is the number of rows of the matrix of observed vertices, \(\mathbf{x}^v_i\).
-
1.
For \(j=1, \ldots , p\), set \(\mathbf{nr}[p-j+1] = \prod _{l=p-j+1}^{p}{(s_{il}+1)}\).
-
2.
Set \(\mathbf{nr}[p+1] = 1\).
\(\underline{\mathrm{Step~2:}}\) Create a (\(p+1\))-vector \(\mathbf{nr}_p\) whose \((j+1)^{th}\) element, for \(j=1,\ldots , p\), is the number of sub-hyperrectangles present in observation i when all variables up to j are excluded.
-
1.
For \(j=1, \ldots , p\), set \(\mathbf{nr}_p[p-j+1] = \prod _{l=p-j+1}^{p}{s_{il}}\).
-
2.
Set \(\mathbf{nr}_p[p+1] = 1\).
\(\underline{\mathrm{Step~3:}}\) Create a p-vector \(\mathbf{sp}\) whose j th element is the position of the element of \(\mathbf{x}_{ep}\) which is the first subinterval endpoint for variable j.
-
1.
Set \(\mathbf{sp}[1] = 1\).
-
2.
For \(j=1, \ldots , p-1\), set \(\mathbf{sp}[j+1] = \sum _{l=1}^j{(s_{il}+j+1)}\).
\(\underline{\mathrm{Step~4:}}\) Initialize the matrix of observed vertices \(\mathbf{x}^v_i\) by letting \(\mathbf{x}^v_i\) be an (\(N_i \times p\)) matrix of zeros where \(N_i=\prod _{j=1}^p{(s_{ij}+1)}\).
\(\underline{\mathrm{Step~5:}}\) Update the elements of \(\mathbf{x}^v_i\) by
-
1.
For \(j=1, \ldots , p\), do
-
(a)
Let \(nj = \mathbf{ns}[j]\).
-
(b)
Let \(rj = \mathbf{nr}[j+1]\).
-
(c)
Let \(sj = \mathbf{sp}[j]\).
-
(d)
For \(l= 0,\ldots , nj\),
-
For \(k = 1, \ldots , rj\),
-
set \(\mathbf{x}^v_i[l(rj) + k,j]=\mathbf{x}_{ep}[sj+l]\).
-
-
(a)
-
2.
For \(j=2, \ldots , p\), do
-
(a)
Let \(tj = \frac{\mathbf{nr}[1]}{\mathbf{nr}[j]}-1\).
-
(b)
Let \(rj = \mathbf{nr}[j]\).
-
(c)
For \(l= 1,\ldots , tj\),
-
For \(k = 1, \ldots , rj\),
-
set \(\mathbf{x}^v_i[l(rj) + k,j]=\mathbf{x}^v_i[k,j]\).
-
-
(a)
End of Step 5. At the end of Step 5, we obtain the matrix \(\mathbf{x}^v_i\) whose rows are the coordinates of the vertices of observation i.
1.2 Constructing the polytopes
The following algorithm includes seven steps:
\(\underline{\mathrm{Step~1:}}\) First, compute the matrix of transformed vertices, \(\mathbf{y}_i^v (=\mathbf{x}^v_i \mathbf{e})\), for the polytope representing observation i in a principal components space.
\(\underline{\mathrm{Step~2:}}\) Next, create a three-dimensional array \(\mathbf{y}_i\) to store the transformed vertices that belong to a sub-polytope together. The array \(\mathbf{y}_i\) is a result of combining \(r_i= \prod ^p_{j=1}{s_{ij}}\) matrices \(\mathbf{y}^h_i\) where \(h = 1, \ldots , r_i\). Each matrix \(\mathbf{y}^h_i\) of dimension (\(2^p \times p\)) contains coordinates of all vertices that belong to sub-polytope h.
-
1.
Initialize array \(\mathbf{y}_i\) by letting \(\mathbf{y}_i\) be an array of zeros with dimension (\(2^p \times p \times r_i\)).
-
2.
Update the elements of \(\mathbf{y}_i\) by running the following nested loop,
-
(a)
Set \(kr_0 = 0\) and \(ni_0 = 0\).
-
(b)
For \(j = 1, \ldots , p-1\),
-
For \(l_j = 0, \ldots , s_{ij}\),
-
i.
Let \(kr_j = kr_{j-1}+(\mathbf{nr}[j+1])l_j\).
-
ii.
Let \(ni_j = ni_{j-1}+(\mathbf{nr}_p[j+1])l_j\).
-
iii.
For \(k = 1, \ldots , \mathbf{ns}[p]\),
-
A.
Set \(kr = kr_{p-1} + k\).
-
B.
Set \(ni = ni_{p-1} + k\).
-
C.
Set \(\mathbf{y}_i[1,,ni]=\mathbf{y}_i^v[kr,]\)
-
D.
For \(o = 1, \ldots , p\), do
-
For \(r = 1, \ldots , 2^{(o-1)}\),
-
set \(\mathbf{y}_i[2^{(o-1)}+r,,ni] = \mathbf{y}_i^v[kr[r] + \mathbf{nr}[p-o+2],]\) and
-
set \(kr = (kr,kr[r] + \mathbf{nr}[p-o+2])\).
-
-
A.
-
i.
-
-
(a)
\(\underline{\mathrm{Step~3:}}\) Next, reconstruct polytopes corresponding to sub-hyperrectangles of observation i by following the next two sub-steps.
\(\underline{\mathrm{Step~3-A.}}\) Construct the matrix of connected vertices \(\mathbf{C}\) associated with \(\mathbf{y}_i^v\) as follows:
-
1.
Initialize \(\mathbf{C}\) as a \(2^p \times p\) matrix of zeros.
-
2.
Update \(\mathbf{C}\) by doing the following step for \(j = 1, \dots ,p\),
-
For \(j_1 = 0, \dots , 2^{(j-1)}-1\), do
-
For \(j_2 = ((2^{(p-j+1)})j_1 + 1),\dots ,((2^{(p-j+1)})j_1 + 2^{(p-j)})\), set \(\mathbf{C}[j_2,j] = j_2 + 2^{(p-j)}\).
-
For \(j_2 = ((2^{(p-j+1)})j_1 + 2^{(p-j)} + 1),\dots ,((2^{(p-j+1)})j_1 + 2^{(p-j+1)})\),
-
set \(\mathbf{C}[j_2,j] = j_2 - 2^{(p-j)}\).
-
-
\(\underline{\mathrm{Step~3-B}}\) A p-dimensional plot of the polytopes is constructed in the principal component space by the following two steps:
-
1.
Make a scatter plot of \(\mathbf{y}_i^v\).
-
2.
Construct the vertices of each sub-polytope as follows, for each \(h = 1,\dots ,r_i\),
-
For \(v_1 = 1,\dots ,2^p\), do
-
for \(j_1 = 2,\dots ,p+1\),
-
set \(v_2 = \mathbf{C}[v_1,j_1]\), and
-
connect the points \(\mathbf{y}_i[v_1,h]\) and \(\mathbf{y}_i[v_2,h]\) with a line.
-
End of Step 3. We now have a plot of the \(r_i\) polytopes representing of observation \(\mathbf{x}_i\), \(i = 1,\dots ,n,\) in PC space. This step is an adaptation of Steps 3–4 for obtaining the polytope for interval-valued data; see Le-Rademacher (2008) and (Le-Rademacher and Billard (2012), Supplemental Material)
At the end of Step 3, polytopes representing observation i in a principal component space are plotted. While these polytopes are now constructed, we recall that the densities of a histogram observation vary across the hyperrectangles. To create the vector of densities for these polytopes, follow the next 4 steps.
\(\underline{\mathrm{Step~4:}}\) Create a p-vector \(\mathbf{sp}_p\) whose j th element is the position of the element of \(\mathbf{x}_{rf}\) which is the first subinterval relative frequency for variable j.
-
1.
Set \(\mathbf{sp}_p[1] = 1\).
-
2.
For \(j=1, \ldots , p-1\),
-
set \(\mathbf{sp}_p[j+1] = \sum _{l=1}^j s_{il}+1\).
-
\(\underline{\mathrm{Step~5:}}\) Let \(\mathbf{x}^v_p\) be an \((r_i \times p)\) matrix of relative frequencies. The row h of \(\mathbf{x}^v_p\) contains the relative frequencies of subintervals making up sub-hyperrectangle h. Initialize \(\mathbf{x}^v_p\) by setting all elements of \(\mathbf{x}^v_p\) to zeros.
\(\underline{\mathrm{Step~6:}}\) Update the elements of \(\mathbf{x}^v_p\) by
-
1.
For \(j=1, \ldots , p\), do
-
Let \(nj = \mathbf{ns}[j]\).
-
Let \(rj = \mathbf{nr}_p[j+1]\).
-
Let \(sj = \mathbf{sp}_p[j]\).
-
For \(l= 0,\ldots , nj-1\),
-
For \(k = 1, \ldots , rj\),
-
set \(\mathbf{x}^v_p[l(rj) + k,j]=\mathbf{x}_{rf}[sj+l]\).
-
-
-
2.
For \(j=2, \ldots , p\), do
-
Let \(tj = \frac{\mathbf{nr}_p[1]}{\mathbf{nr}_p[j]}-1\).
-
Let \(rj = \mathbf{nr}_p[j]\).
-
For \(l= 1,\ldots , tj\),
-
For \(k = 1, \ldots , rj\),
-
set \(\mathbf{x}^v_p[l(rj) + k,j]=\mathbf{x}^v_p[k,j]\).
-
-
-
\(\underline{\mathrm{Step~7:}}\) Let \(\mathbf{d}_i\) be an \(r_i\)-vector whose elements are densities of the sub-hyperrectangles belonging to observation i. The density for each sub-hyperrectangle is the product of relative frequencies of the p subintervals making up that sub-hyperrectangle. That is, for \(h=1, \ldots , r_i\), \(\mathbf{d}_i[h] = \prod ^p_{j=1}{\mathbf{x}^v_p[h,j]}\).
At the end of Step 7, we obtain a vector of densities \(\mathbf{d}_i\) whose h th element is the density of sub-hyperrectangle h of observation i.
1.3 Constructing two and three dimensional plots
Usually, visualization of the projections of observations onto the principal component space is limited to two dimensions, \(PC_{\nu _1} \times PC_{\nu _2}\). This is achieved by replacing the substeps 1 and 2 in Step 3-B of the polytope algorithm of Sect. 1, by the following three substeps:
-
1.
Let \(\mathbf{y}^{(2)}_i\) be the \(r_i2^p \times 2\) matrix whose first and second columns are, respectively, columns \(\nu _1\) and \(\nu _2\) of \(\mathbf{y}_i^v\).
-
2.
Make a scatter plot of \(\mathbf{y}^{(2)}_i\).
-
3.
Connect corresponding points of \(\mathbf{y}^{(2)}_i\) by using substep 2 of Step 3-B of Sect. 1, with \(\mathbf{y}_i^v\) replaced by \(\mathbf{y}^{(2)}_i\); now \(p=2\).
To construct a three-dimensional plot of \(PC_{\nu _1} \times PC_{\nu _2} \times PC_{\nu _3}\), follow the same three steps as here for constructing two-dimensional plots except that \(\mathbf{y}^{(2)}_i\) is replaced by \(\mathbf{y}^{(3)}_i\) where now, in substep 1, \(\mathbf{y}^{(3)}_i\) is an \((r_i2^p \times 3)\) matrix with columns \(\nu _1\), \(\nu _2\), and \(\nu _3\) of \(\mathbf{y}_i^v\). In substep 3, \(p=3\).
1.4 Constructing the PC histograms
The following algorithm constructs a histogram representing the \(\nu ^{th}\) principal component for observation i by first computing the PC histograms corresponding to the sub-polytopes of observation i, then combine the \(r_i\) histograms into one \({PC}_{\nu }\) histogram representing observation i.
\(\underline{\mathrm{Step~1:}}\) Follow the algorithm of Le-Rademacher and Billard (2013) to create the (\(r_i \times 3s\)) matrix \(\mathbf {z}_{i \nu }\) whose h th row contains the subinterval endpoints and the relative frequencies for sub-polytope h as specified in Eq. (14). Here, elements \(3k, k=1 \ldots , s,\) of \(\mathbf{{z}}_{i \nu }[h,]\) are the unadjusted relative frequencies of the histogram representing sub-polytope h.
\(\underline{\mathrm{Step~2:}}\) Update the relative frequencies from Step 1 by setting \(\mathbf{{z}}_{i \nu }[h, 3k] = \mathbf{{d}}_i[h]\mathbf{{z}}_{i \nu }[h, 3k]\).
\(\underline{\mathrm{Step~3:}}\) This next step combines the s histograms in \(\mathbf{{z}}_{i \nu }\) into one histogram with subintervals of equal width.
-
1.
Let lo and hi be the lowest and the highest endpoints of the \(r_i\) histograms of observation i. Then, \(lo = min(\mathbf{{z}}_{i \nu }[,1])\) and \(hi = max(\mathbf{{z}}_{i \nu }[,3s-1])\).
-
2.
Let sn denote the desired number of subintervals for the combined histogram. Then, the widths of the subintervals are \(sw = (hi - lo)/sn\).
-
3.
Let \(\mathbf {hm}\) be an (\(sn \times 3\)) transition matrix whose columns 1 and 2 contain the subinterval endpoints and column 3 contains the relative frequencies of the combined \({PC}_{\nu }\) histogram for observation i. Initialize \(\mathbf {hm}\) by setting its elements to zero.
-
4.
Update \(\mathbf {hm}\) as follows, For \(t = 1, \ldots , ns\), do
-
(a)
Set the endpoints of subinterval t by letting \(\mathbf{{hm}}[t,1] = lo + (sw)(t-1)\) and \(\mathbf{{hm}}[t,2] = lo + (sw)t\).
-
(b)
Let \(\mathbf {fr}\) be an (\(r_i \times s\)) matrix whose (h, q) element corresponds to the proportion of subinterval q of sub-polytope h that falls within the subinterval t. Initialize \(\mathbf {fr}\) by setting its elements to zero.
-
For \(h = 1,\ldots , r_i\), do
-
For \(q = 1,\ldots , s\), do \(\underline{\mathrm{Case~a:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] \ge \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] \le \mathbf{{hm}}[s,2]\)), set \(\mathbf{{fr}}[h,q] = \mathbf{{z}}_{i \nu }[h,3q]\). \(\underline{\mathrm{Case~b:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] \ge \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-2] < \mathbf{{hm}}[s,2]\)) and \(\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,2]\), set \(\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{hm}}[s,2]-\mathbf{{z}}_{i \nu }[h,3q-2])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}\). \(\underline{\mathrm{Case~c:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] < \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] \le \mathbf{{hm}}[s,2]\)), set \(\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{hm}}[s,1])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}\). \(\underline{\mathrm{Case~d:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] < \mathbf{}{hm}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,2]\)), set \(\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{hm}}[s,2]-\mathbf{{hm}}[s,1])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}\).
-
-
(c)
Let \(\mathbf{{hm}}[t,3] = \sum _{h=1}^{r_i}{\sum _{q=1}^s{\mathbf{{fr}}[h,q]}}\).
-
(a)
-
5.
Let \(sh = \sum ^{ns}_{t=1}{\mathbf{{hm}}[t,3]}\).
-
6.
Update \(\mathbf{{hm}}[t,3] =\mathbf{{hm}}[t,3]/sh\).
At the end of this step, we have the subinterval endpoints and the relative frequencies for the combined histogram. Let \(\mathbf {pc}_{\nu }\) be the (\(n \times ns\)) matrix whose i th row contains the \({PC}_{\nu }\) histogram for observation i. Then, for \(t = 1,\ldots , ns\), do
-
1.
Let \(\mathbf{{pc}}_{\nu }[i,3t-2] = \mathbf{{hm}}[t, 1]\).
-
2.
Let \(\mathbf{{pc}}_{\nu }[i,3t-1] = \mathbf{{hm}}[t, 2]\).
-
3.
Let \(\mathbf{{pc}}_{\nu }[i,3t] = \mathbf{{hm}}[t, 3]\).
This step concludes the histogram algorithm. Repeat these steps for all observations.
Rights and permissions
About this article
Cite this article
Le-Rademacher, J., Billard, L. Principal component analysis for histogram-valued data. Adv Data Anal Classif 11, 327–351 (2017). https://doi.org/10.1007/s11634-016-0255-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0255-9