p-value distribution of the Brunner–Munzel test in the finite case

In our of the previous post, I explored the distribution of observed p-values for the Mann–Whitney U test in the finite case when the null hypothesis is true. It is time to repeat the experiment for the Brunner–Munzel test.

We generate $100\,000$ pairs of samples of size $n$ from the standard normal distribution, calculate the p-value using the two-sided Brunner–Munzel test, and build the density plot for the observed p-values. We use a test implementation that is extended for the corner case with values $0$ and $1$. Here is the result for $n=3$:

Similarly to other rank-based tests, we get a discrete distribution with a limited set of different p-values. The probabilities of each p-value are the following:

$$ \mathbb{P}(p = 0.0000) = 0.1, $$$$ \mathbb{P}(p \approx 0.0686) = 0.1, $$$$ \mathbb{P}(p \approx 0.3465) = 0.2, $$$$ \mathbb{P}(p \approx 0.5734) = 0.1, $$$$ \mathbb{P}(p \approx 0.6667) = 0.2, $$$$ \mathbb{P}(p \approx 0.8683) = 0.1, $$$$ \mathbb{P}(p \approx 0.8727) = 0.2. $$

As we can see, the observed distribution is not as nice as in the the Mann–Whitney U test case. In particular, the expectations of influence of the specified statistical significance level to the actual false positive rate ($\mathbb{P}(p \leq \alpha) = \alpha$) are distorted:

$$ \mathbb{P}(p \leq \alpha)= 0.1 \quad\textrm{for}\quad \alpha \in [0.0000;0.0686), $$$$ \mathbb{P}(p \leq \alpha) = 0.2 \quad\textrm{for}\quad \alpha \in [0.0686;0.3465), $$$$ \mathbb{P}(p \leq \alpha) = 0.4 \quad\textrm{for}\quad \alpha \in [0.3465;0.5734), $$$$ \mathbb{P}(p \leq \alpha) = 0.5 \quad\textrm{for}\quad \alpha \in [0.5734;0.6667), $$$$ \mathbb{P}(p \leq \alpha) = 0.7 \quad\textrm{for}\quad \alpha \in [0.6667;0.8683), $$$$ \mathbb{P}(p \leq \alpha) = 0.9 \quad\textrm{for}\quad \alpha \in [0.8683;0.8727), $$$$ \mathbb{P}(p \leq \alpha) = 1.0 \quad\textrm{for}\quad \alpha \in [0.8727;1.0000). $$

Therefore, the test should be used cautiously when considering small samples. For example, if we set the statistical significance level $\alpha = 0.07$, the actual false-positive rate will be $\mathbb{P}(p \leq \alpha) = 0.2$.

Now let us look at the same distribution for $n=5$, $n=7$, and $n=15$:

Asymptotically, it becomes uniform as for other statistical tests. However, on small samples, it has a strange sawtooth-like shape that alter our expectations of the false-positive rate.