Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference F-type test | Computational Statistics Skip to main content
Log in

Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference F-type test

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The problem of testing the equality of mean vectors for high-dimensional data has been intensively investigated in the literature. However, most of the existing tests impose strong assumptions on the underlying group covariance matrices which may not be satisfied or hardly be checked in practice. In this article, an F-type test for two-sample Behrens–Fisher problems for high-dimensional data is proposed and studied. When the two samples are normally distributed and when the null hypothesis is valid, the proposed F-type test statistic is shown to be an F-type mixture, a ratio of two independent \(\chi ^2\)-type mixtures. Under some regularity conditions and the null hypothesis, it is shown that the proposed F-type test statistic and the above F-type mixture have the same normal and non-normal limits. It is then justified to approximate the null distribution of the proposed F-type test statistic by that of the F-type mixture, resulting in the so-called normal reference F-type test. Since the F-type mixture is a ratio of two independent \(\chi ^2\)-type mixtures, we employ the Welch–Satterthwaite \(\chi ^2\)-approximation to the distributions of the numerator and the denominator of the F-type mixture respectively, resulting in an approximation F-distribution whose degrees of freedom can be consistently estimated from the data. The asymptotic power of the proposed F-type test is established. Two simulation studies are conducted and they show that in terms of size control, the proposed F-type test outperforms two existing competitors. The good performance of the proposed F-type test is also illustrated by a COVID-19 data example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anderson TW (2003) An introduction to multivariate statistical analysis. Wiley series in probability and statistics. Wiley, Hoboken

    Google Scholar 

  • Bai ZD, Saranadasa H (1996) Effect of high dimension: by an example of a two sample problem. Stat Sin 6(2):311–329

    MathSciNet  Google Scholar 

  • Chen SX, Qin Y-L (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat 38(2):808–835

    Article  MathSciNet  Google Scholar 

  • Dempster AP (1958) A high dimensional two sample significance test. Ann Math Stat 29(4):995–1010

    Article  MathSciNet  Google Scholar 

  • Dempster AP (1960) A significance test for the separation of two highly multivariate small samples. Biometrics 16(1):41–50

    Article  MathSciNet  Google Scholar 

  • Fisher RA (1935) The fiducial argument in statistical inference. Ann Eugen 6(4):391–398

    Article  Google Scholar 

  • Fisher RA (1939) The comparison of samples with possibly unequal variances. Ann Eugen 9(2):174–180

    Article  Google Scholar 

  • James G (1954) Tests of linear hypotheses in univariate and multivariate analysis when the ratios of the population variances are unknown. Biometrika 41(1/2):19–43

    Article  MathSciNet  Google Scholar 

  • Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of squares in a weighted linear regression. Biometrika 67(1):85–92

    Article  MathSciNet  Google Scholar 

  • Liu X, Guo J, Zhou B, Zhang J-T (2016) Two simple tests for heteroscedastic two-way ANOVA. Stat Res Lett 5(6):6–16

  • Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2(6):110–114

    Article  Google Scholar 

  • Scheffé H (1970) Practical solutions of the Behrens-Fisher problem. J Am Stat Assoc 65(332):1501–1508

    MathSciNet  Google Scholar 

  • Srivastava MS, Fujikoshi Y (2006) Multivariate analysis of variance with fewer observations than the dimension. J Multivar Anal 97(9):1927–1940. https://doi.org/10.1016/j.jmva.2005.08.010

    Article  MathSciNet  Google Scholar 

  • Tang S, Tsui K-W (2007) Distributional properties for the generalized p-value for the Behrens-Fisher problem. Stat Probab Lett 77(1):1–8. https://doi.org/10.1016/j.spl.2006.05.005

    Article  MathSciNet  Google Scholar 

  • Thair SA, He YD, Hasin-Brumshtein Y, Sakaram S, Pandya R, Toh J, Rawling D, Remmel M, Coyle S, Dalekos GN (2021) Transcriptomic similarities and differences in host response between ARS-CoV-2 and other viral infections. Iscience 24(1):101947

    Article  Google Scholar 

  • Welch BL (1947) The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika 34(1/2):28–35

    Article  MathSciNet  Google Scholar 

  • Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher problem. Biometrika 52(1/2):139–147

    Article  MathSciNet  Google Scholar 

  • Zhang J-T (2005) Approximate and asymptotic distributions of chi-squared-type mixtures with applications. J Am Stat Assoc 100(469):273–285

    Article  MathSciNet  Google Scholar 

  • Zhang J-T (2011) Two-way MANOVA with unequal cell sizes and unequal cell covariance matrices. Technometrics 53(4):426–439

    Article  MathSciNet  Google Scholar 

  • Zhang J-T (2012) An approximate Hotelling \(T^2\)-test for heteroscedastic one-way MANOVA. Open J Stat 2(1):1–11

    Article  MathSciNet  Google Scholar 

  • Zhang J-T (2013) Analysis of variance for functional data. Chapman and Hall/CRC, New York

    Book  Google Scholar 

  • Zhang J-T (2013) Tests of linear hypotheses in the ANOVA under heteroscedasticity. Int J Adv Stat Probab 1(2):9–24

    Article  Google Scholar 

  • Zhang J-T, Zhu T (2022) A further study on Chen-Qin’s test for two-sample Behrens-Fisher problems for high-dimensional data. J Stat Theory Pract 16(1):1

    Article  MathSciNet  Google Scholar 

  • Zhang J-T, Guo J, Zhou B, Liu X (2016) A modified Bartlett test for heteroscedastic two-way MANOVA. J Adv Stat 1(2):94–108

    Article  Google Scholar 

  • Zhang J-T, Guo J, Zhou B, Cheng M-Y (2020) A simple two-sample test in high dimensions based on \(L^2\)-norm. J Am Stat Assoc 115(530):1011–1027

    Article  Google Scholar 

  • Zhang J-T, Zhou B, Guo J, Zhu T (2021) Two-sample Behrens-Fisher problems for high-dimensional data: a normal reference approach. J Stat Plan Inference 213:142–161

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Zhang and Zhu’s research was partially supported by the National University of Singapore academic research grant (22-5699-A0001) and the National Institute of Education (NIE) start-up grant (NIE-SUG 6-22 ZTM), respectively. The authors thank the Editor in Chief and the anonymous reviewers for their constructive comments and suggestions which help us to improve the article substantially.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tianming Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 3 KB)

Appendix A. Technical proofs

Appendix A. Technical proofs

Proof of Theorem 1

We shall apply Theorems 1 and 2 of Zhang et al. (2021) for the proof of this theorem. Notice that under Conditions C1–C3, by applying Theorem 2 of Zhang et al. (2021), \({\text {tr}}(\hat{{{\varvec{\Sigma }}}}_n)\) is ratio-consistent for \({\text {tr}}({{\varvec{\Sigma }}}_n)\) uniformly for all p. We can write

$$\begin{aligned} F_{n,p,0}=\frac{T_{n,p,0}}{{\text {tr}}({{\varvec{\Sigma }}}_n)}[1+o_p(1)],\; \text{ and } \;\tilde{F}_{n,p,0}=\frac{T_{n,p,0}-{\text {tr}}({{\varvec{\Sigma }}}_n)}{\sqrt{2{\text {tr}}({{\varvec{\Sigma }}}_n^2)}}[1+o_p(1)]. \end{aligned}$$

From (13), we have \({\text {E}}(S_{n,p,0}^*)={\text {tr}}({{\varvec{\Sigma }}}_n)\) and

$$\begin{aligned} {\text {Var}}\left[ S_{n,p,0}^*/{\text {tr}}({{\varvec{\Sigma }}}_n)\right] =\frac{2[n_2^{2}(n_1-1)^{-1}{\text {tr}}({{\varvec{\Sigma }}}_1^2)+n_1^{2}(n_2-1)^{-1}{\text {tr}}({{\varvec{\Sigma }}}_2^2)]}{n_2^{2}{\text {tr}}^2({{\varvec{\Sigma }}}_1)+n_1^{2}{\text {tr}}^2({{\varvec{\Sigma }}}_2)+2n_1n_2{\text {tr}}({{\varvec{\Sigma }}}_1){\text {tr}}({{\varvec{\Sigma }}}_2)}. \end{aligned}$$

Under Condition C3, as \(n\rightarrow \infty\), \({\text {Var}}\left[ S_{n,p,0}^*/{\text {tr}}({{\varvec{\Sigma }}}_n)\right] \rightarrow 0\) uniformly for all p. That is, \(S_{n,p,0}^*/{\text {tr}}({{\varvec{\Sigma }}}_n){\mathop {\longrightarrow }\limits ^{P}}1\) uniformly for all p. Therefore, we can write

$$\begin{aligned} F^*_{n,p,0}=\frac{T^*_{n,p,0}}{{\text {tr}}({{\varvec{\Sigma }}}_n)}[1+o_p(1)],\; \text{ and } \;\tilde{F}^*_{n,p,0}=\frac{T^*_{n,p,0}-{\text {tr}}({{\varvec{\Sigma }}}_n)}{\sqrt{2{\text {tr}}({{\varvec{\Sigma }}}_n^2)}}[1+o_p(1)]. \end{aligned}$$

Then under Conditions C1–C4, as \(n,p\rightarrow \infty\), Theorem 1(a) and (17) follow directly from Theorem 1(a) of Zhang et al. (2021), and under Conditions C1–C3 and C5, as \(n,p\rightarrow \infty\), Theorem 1(b) and (17) follow directly from Theorem 1(b) of Zhang et al. (2021). \(\square\)

Proof of Theorem 2

We first prove (a). Under Conditions C1–C4, Theorem 1(a) indicates that as \(n,p\rightarrow \infty\) we have \((F_{n,p,0}-1)/\sqrt{2/d_1}{\mathop {\longrightarrow }\limits ^{L}}\zeta\). In addition, under Conditions C1–C3, as \(n\rightarrow \infty\), we have \(\hat{d}_1/d_1{\mathop {\longrightarrow }\limits ^{P}}1\) and \(\hat{d}_2/d_2{\mathop {\longrightarrow }\limits ^{P}}1\) uniformly for all p. Therefore, under the given conditions, we have

$$\begin{aligned} \begin{aligned}&\quad \Pr \left[ F_{n,p}\ge F_{\hat{d}_1,\hat{d}_2}(\alpha )\right] \\&=\Pr \left[ F_{n,p,0}\ge F_{\hat{d}_1,\hat{d}_2}(\alpha )-\frac{n_1n_2n^{-1}\Vert {{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2\Vert ^2}{{\text {tr}}(\hat{{{\varvec{\Sigma }}}}_n)}\right] [1+o(1)]\\&=\Pr \left[ \frac{F_{n,p,0}-1}{\sqrt{2/d_1}}\ge \frac{F_{\hat{d}_1,\hat{d}_2}(\alpha )-1}{\sqrt{2/d_1}}-\frac{n_1n_2n^{-1}\Vert {{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2\Vert ^2}{\sqrt{2/d_1}{\text {tr}}(\hat{{{\varvec{\Sigma }}}}_n)}\right] [1+o(1)]\\&=\Pr \left\{ \zeta \ge \frac{F_{d_1,d_2}(\alpha )-1}{\sqrt{2/d_1}}-\frac{n\tau (1-\tau )\Vert {{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2\Vert ^2}{\left[ 2{\text {tr}}({{\varvec{\Sigma }}}^2)\right] ^{1/2}}\right\} [1+o(1)].\\ \end{aligned} \end{aligned}$$

Next we prove (b). Under Conditions C1–C3 and C5, Theorem 1(b) indicates that as \(n\rightarrow \infty\), we have \((F_{n,p,0}-1)/\sqrt{2/d_1}{\mathop {\longrightarrow }\limits ^{L}}\mathcal {N}(0,1)\). By Remark 2, we have \([F_{d_1,d_2}(\alpha )-1]/\sqrt{2/d_1}\rightarrow z_\alpha\) when \(d_2\rightarrow \infty\). Therefore, under the given conditions, we have

$$\begin{aligned} \begin{aligned}&\quad \Pr \left[ F_{n,p}\ge F_{\hat{d}_1,\hat{d}_2}(\alpha )\right] \\&=\Pr \left[ F_{n,p,0}\ge F_{\hat{d}_1,\hat{d}_2}(\alpha )-\frac{n_1n_2n^{-1}\Vert {{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2\Vert ^2}{{\text {tr}}(\hat{{{\varvec{\Sigma }}}}_n)}\right] [1+o(1)]\\&=\Pr \left[ \frac{F_{n,p,0}-1}{\sqrt{2/d_1}}\ge \frac{F_{\hat{d}_1,\hat{d}_2}(\alpha )-1}{\sqrt{2/d_1}}-\frac{n_1n_2n^{-1}\Vert {{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2\Vert ^2}{\sqrt{2/d_1}{\text {tr}}(\hat{{{\varvec{\Sigma }}}}_n)}\right] [1+o(1)]\\&=\Phi \left\{ -z_{\alpha }+\frac{n\tau (1-\tau )\Vert {{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2\Vert ^2}{\left[ 2{\text {tr}}({{\varvec{\Sigma }}}^2)\right] ^{1/2}}\right\} [1+o(1)],\\ \end{aligned} \end{aligned}$$

where \(\Phi (\cdot )\) denotes the cumulative distribution of \(\mathcal {N}(0,1)\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, T., Wang, P. & Zhang, JT. Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference F-type test. Comput Stat 39, 3207–3230 (2024). https://doi.org/10.1007/s00180-023-01433-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-023-01433-6

Keywords

Navigation