When we work with null hypothesis significance testing and the null hypothesis is true, the distribution of observed p-value is asymptotically uniform. However, the distribution shape is not always uniform in the finite case. For example, when we work with rank-based tests like the Mann–Whitney U test, the distribution of the p-values is discrete with a limited set of possible values. This should be taken into account when we design a testing procedure for small samples and choose the significance level.
Previously, we already discussed the minimum reasonable significance level of the Mann-Whitney U test for small samples. In this post, we explore the full distribution of the p-values for this case.
Student’s t-test
We start with the Student’s t-test to check the p-value distribution in the “simple” case. Let’s generate \(10\,000\) pairs of samples of size \(5\) from the standard normal distribution, calculate the p-value using the two-sided Student’s t-test, and build the density plot for the observed p-values:
As we can see, the distribution looks uniform. And this is the desired property of a statistical test. Indeed, the specified significant level \(\alpha\) is used to specify the desired false-positive rate. Mathematically, it can be expressed as \(\mathbb{P}(p \leq \alpha) = \alpha\), which is a definition of the uniform distribution. Now let us see what would happen if we switch to the Mann–Whitney U test.
Mann–Whitney U test
Now we generate \(10\,000\) pairs of samples of size \(n\) from the standard normal distribution, calculate the p-value using the two-sided Mann–Whitney U test, and build the density plot for the observed p-values. Here is the result for \(n=3\):
As we can see, if both samples contain exactly three elements each, the p-value always belongs to the following set (assuming the distribution is continuous, the samples do not contain ties): \(\{ 0.1, 0.2, 0.4, 0.7, 1.0 \}\). Based on the above plot, we can even guess the probability of observing each p-value:
\[\mathbb{P}(p = 0.1) = 0.1, \]
\[\mathbb{P}(p = 0.2) = 0.1, \]
\[\mathbb{P}(p = 0.4) = 0.2, \]
\[\mathbb{P}(p = 0.7) = 0.3, \]
\[\mathbb{P}(p = 1.0) = 0.3. \]
Thus, \(\mathbb{P}(p \leq \alpha) = \alpha\) is true only for \(\alpha\) values from the same set. However, it is not true for other \(\alpha\) values. Thus,
\[\mathbb{P}(p \leq \alpha) = 0,\quad\textrm{for}\quad \alpha \in [0;0.1), \]
\[\mathbb{P}(p \leq \alpha) = 0.1,\quad\textrm{for}\quad \alpha \in [0.1;0.2), \]
\[\mathbb{P}(p \leq \alpha) = 0.2,\quad\textrm{for}\quad \alpha \in [0.2;0.4), \]
\[\mathbb{P}(p \leq \alpha) = 0.4,\quad\textrm{for}\quad \alpha \in [0.4;0.7), \]
\[\mathbb{P}(p \leq \alpha) = 0.7,\quad\textrm{for}\quad \alpha \in [0.7;1.0), \]
If changes of \(\alpha\) within any of these intervals (e.g., from \(\alpha = 0.19\) to \(\alpha = 0.11\)) will not affect the test result.
Now let us look at the same distribution for \(n=5\), \(n=7\), and \(n=15\):
As we can see, as \(n\) grows, we get more distinct values in the observed distribution of p-values, but the list of the exact values is always limited. It can also be easily shown that when we compare two samples of sizes \(n\) and \(m\) using the two-sided Mann–Whitney U test, all possible p-values can be expressed as \(2k/C_{n+m}^n,\;k\in \mathbb{N}\), and \(2/C_{n+m}^n\) is the minimum possible value.