Abstract
The p value is the probability under the null hypothesis of obtaining an experimental result that is at least as extreme as the one that we have actually obtained. That probability plays a crucial role in frequentist statistical inferences. But if we take the word ‘extreme’ to mean ‘improbable’, then we can show that this type of inference can be very problematic. In this paper, I argue that it is a mistake to make such an interpretation. Under minimal assumptions about the alternative hypothesis, I explain why ‘extreme’ means ‘outside the most precise predicted range of experimental outcomes for a given upper bound probability of error’. Doing so, I rebut recent formulations of recurrent criticisms against the frequentist approach in statistics and underscore the importance of random variables.
Similar content being viewed by others
Notes
There are two main schools of thought in frequentist testing: the Fisherian and the Neyman–Pearson. The decision rule presented here is more adequate for a Neyman–Pearson framework. According to the latter, the rejection of H0 implies the acceptance of an alternative hypothesis (H1). The Neyman–Pearson approach accordingly aims to minimise the probability of rejecting H0 when H0 is true (the type-I error) and to minimise the probability of rejecting H1 when H1 is true (the type-II error). Fisher, on the other hand, was against a formal treatment of the type-II error. He also criticised the ‘accept/reject’ procedure and preferred to interpret the p value as providing degrees of evidence against H0. I will alert the reader when the differences can matter.
The null hypothesis is the default hypothesis. It is the one that we accept unless the evidence suggests that we should reject it.
What I mean by ‘entrenched’ is that they are recurrent and appear in high-profile publications.
Elliott Sober coined the expression ‘probabilistic modus tollens’. I shall also explain why he claims that it is invalid.
Ian Hacking traces back the origin of that fallacy to John Arbuthnot (1710) (Hacking 1965, p. 75).
Sober actually discusses an experiment involving a coin. But the point is essentially the same.
Wagenmakers’ article also provides references to other scientific work in which we can find the same argument.
The significance level of a test (\(\alpha \) for short) is the threshold that determines if a p value is low enough to reject H0.
A critical region is a set of extreme outcomes such that we would reject H0 if our test statistic belonged to it. If every possible outcome is as extreme as any other, then the critical region includes (or excludes) all of them, which is unreasonable.
When we are dealing with discrete variables, we talk about the distribution of mass and when we are working with continuous variables, we talk about the distribution of density.
That definition is more precise since there might be more than one variable involved in a statistical test.
I would like to point out that the puzzle is not very convincing. The only difference between the two distributions in Fig. 1 should be a difference of parameters. It is not obvious to see what kind of parameter would create both distributions when we change its value.
Here I make 50 rolls instead of ten because it validates the following chi-square test and it makes every possible vectors very improbable.
A computer simulation of a fair die generated the latter (see Appendix).
We define the distribution with 5 degrees of freedom because once we have counted the observed frequencies for 5 dimensions of our random vector, we can simply deduce the frequency associated with the remaining dimension.
References
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd.
Greco, D. (2011). Significance testing in theory and practice. British Journal for the Philosophy of Science, 62, 607–637.
Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.
Hines, W. W., et al. (2003). Probability and statistics in engineering (4th ed.). New York: Wiley.
Hogg, R. V., & Craig, A. T. (1995). Introduction to mathematical statistics (5th ed.). Englewood Cliffs, NJ: Prentice Hall.
Jeffreys, H. (1961). Theory of probability. Oxford: Oxford University Press.
Sober, E. (2008). Evidence and evolution. Cambridge: Cambridge University Press.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of \(P\) values. Psychonomic Bulletin & Review, 14, 779–804.
Acknowledgments
I am grateful to the anonymous referees for their very helpful comments. I would also like to thank the participants at the ‘Journée de travail en philosophie analytique’ at Laval University.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Here is the program that I used to obtain S with R:
library(TeachingDemos)
dice(rolls=50, ndice=1, sides=6, plot.it=TRUE)
Here is the program that I used to perform a ‘Goodness of Fit’ test with R:
vect=c(9,7,7,7,13,7)
vectprob=c(1/6,1/6,1/6,1/6,1/6,1/6)
chisq.test(vect, p=vectprob)
Here is the program that I used to maximise the multinomial mass funtion with R:
a=factorial(50)
b=factorial(8)
c=factorial(9)
\(denom=(c^{\hat{}}2)^{*}(b^{\hat{}}4)\)
d=(a/denom)
\(frac=1/(6^{\hat{}}50)\)
\( d*frac\)
Rights and permissions
About this article
Cite this article
Rochefort-Maranda, G. On the correct interpretation of p values and the importance of random variables. Synthese 193, 1777–1793 (2016). https://doi.org/10.1007/s11229-015-0807-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11229-015-0807-0