p-value distribution of the Brunner–Munzel test in the finite case


In our of the previous post, I explored the distribution of observed p-values for the Mann–Whitney U test in the finite case when the null hypothesis is true. It is time to repeat the experiment for the Brunner–Munzel test.

We generate \(100\,000\) pairs of samples of size \(n\) from the standard normal distribution, calculate the p-value using the two-sided Brunner–Munzel test, and build the density plot for the observed p-values. We use a test implementation that is extended for the corner case with values \(0\) and \(1\). Here is the result for \(n=3\):

Similarly to other rank-based tests, we get a discrete distribution with a limited set of different p-values. The probabilities of each p-value are the following:

\[\mathbb{P}(p = 0.0000) = 0.1, \]

\[\mathbb{P}(p \approx 0.0686) = 0.1, \]

\[\mathbb{P}(p \approx 0.3465) = 0.2, \]

\[\mathbb{P}(p \approx 0.5734) = 0.1, \]

\[\mathbb{P}(p \approx 0.6667) = 0.2, \]

\[\mathbb{P}(p \approx 0.8683) = 0.1, \]

\[\mathbb{P}(p \approx 0.8727) = 0.2. \]

As we can see, the observed distribution is not as nice as in the the Mann–Whitney U test case. In particular, the expectations of influence of the specified statistical significance level to the actual false positive rate (\(\mathbb{P}(p \leq \alpha) = \alpha\)) are distorted:

\[\mathbb{P}(p \leq \alpha)= 0.1 \quad\textrm{for}\quad \alpha \in [0.0000;0.0686), \]

\[\mathbb{P}(p \leq \alpha) = 0.2 \quad\textrm{for}\quad \alpha \in [0.0686;0.3465), \]

\[\mathbb{P}(p \leq \alpha) = 0.4 \quad\textrm{for}\quad \alpha \in [0.3465;0.5734), \]

\[\mathbb{P}(p \leq \alpha) = 0.5 \quad\textrm{for}\quad \alpha \in [0.5734;0.6667), \]

\[\mathbb{P}(p \leq \alpha) = 0.7 \quad\textrm{for}\quad \alpha \in [0.6667;0.8683), \]

\[\mathbb{P}(p \leq \alpha) = 0.9 \quad\textrm{for}\quad \alpha \in [0.8683;0.8727), \]

\[\mathbb{P}(p \leq \alpha) = 1.0 \quad\textrm{for}\quad \alpha \in [0.8727;1.0000). \]

Therefore, the test should be used cautiously when considering small samples. For example, if we set the statistical significance level \(\alpha = 0.07\), the actual false-positive rate will be \(\mathbb{P}(p \leq \alpha) = 0.2\).

Now let us look at the same distribution for \(n=5\), \(n=7\), and \(n=15\):

Asymptotically, it becomes uniform as for other statistical tests. However, on small samples, it has a strange sawtooth-like shape that alter our expectations of the false-positive rate.