Notes / Mann-Whitney U Test

Classic Mann–Whitney $U$ test

We consider the one-sided Mann-Whitney U test that compares two samples $\mathbf{x}$ and $\mathbf{y}$:

$$\mathbf{x} = ( x_1, x_2, \ldots, x_n ), \quad \mathbf{y} = ( y_1, y_2, \ldots, y_m ).$$

The $U$ statistic for this test is defined as follows:

$$U(x, y) = \sum_{i=1}^n \sum_{j=1}^m S(x_i, y_j),\quad S(a,b) = \begin{cases} 1, & \text{if } a > b, \\ 0.5, & \text{if } a = b, \\ 0, & \text{if } a < b. \end{cases}$$

The $p$-value is obtained based on the distribution of the $U$ statistic. First, we discuss the non-weighted case, and only then continue with the fast implementation.

Classic Mann–Whitney $U$ test: implementation

Now, let’s discuss the implementation. First of all, I want to share my preferences for implementing the classic version. There are two ways to estimate the $p$-value: the exact one and the approximate one.

The classic approach based on the recurrence equation $p_{n,m}(u) = p_{n-1,m}(u - m) + p_{n,m-1}(u)$ is very slow and requires $\mathcal{O}(n m)$ memory. While it’s used in almost all standard implementations of the Mann–Whitney $U$ test, it’s too inefficient in practice. Fortunately, now we the Andreas Löffler’s implementation (see Mann-Whitney U Test: Löffler's Implementation) that allows us to calculate the exact $p$-value much faster using $\mathcal{O}(u)$ memory (worst case $\mathcal{O}(nm)$). It is defined as follows:

$$p_{n,m}(u) = \frac{1}{u} \sum_{i=0}^{u-1} p_{n,m}(i) \cdot \sigma_{n,m}(u - i),$$ $$\sigma_{n,m}(u) = \sum_{u \operatorname{mod} d} \varepsilon_d d,\quad\textrm{where}\; \varepsilon_d = \begin{cases} 1, & \textrm{where}\; 1 \leq d \leq n, \\ 0, & \textrm{else}, \\ -1, & \textrm{where}\; m+1 \leq d \leq m+n. \end{cases}$$

In some cases, we can’t use the exact algorithm because it’s too computationally expensive. It corresponds to two primary cases. The first one is about large sample sizes. The limitation for the classic exact implementation is much stricter than for Löffler’s algorithm, but even Löffler’s algorithm has a reasonable maximum. The second one is about middle values of the $U$ statistic for medium sample sizes (in this case, the computation time may be high, and approximation may provide good accuracy).

The classic normal approximation is too inaccurate: it may produce errors of incredible magnitude (see Mann-Whitney U Test: Error Evaluation). Fortunately, we have the Mann-Whitney U Test: Edgeworth expansion, which greatly increases the approximation accuracy.

The Edgeworth expansion extends the normal approximation approach, which based on the normal distribution $\mathcal{N}(\mu_U, \sigma_U^2)$ defined by the following parameters:

$$\mu_U = \frac{nm}{2},\quad \sigma_U = \sqrt{\frac{nm(n+m+1)}{12}}.$$

The $z$-score is calculated with the continuity correction:

$$z = \frac{U - \mu_U \pm 0.5}{\sigma_U},$$

The $p$-value is defined as follows (assuming the Edgeworth expansion to terms of order $1/m^2$):

$$p_{E7}(z) = \Phi(z) + e^{(3)} \varphi^{(3)}(z) + e^{(5)} \varphi^{(5)}(z) + e^{(7)} \varphi^{(7)}(z),$$

where

$$e^{(3)} = \frac{1}{4!}\left( \frac{\mu_4}{\mu_2^2} - 3 \right),\quad e^{(5)} = \frac{1}{6!}\left( \frac{\mu_6}{\mu_2^3} - 15\frac{\mu_4}{\mu_2^2} + 30 \right),\quad e^{(7)} = \frac{35}{8!}\left( \frac{\mu_4}{\mu_2^2} - 3 \right)^2,$$ $$\mu_2 = \frac{nm(n+m+1)}{12},$$ $$\mu_4 = \frac{mn(m+n+1)}{240} \bigl( 5(m^2 n + m n^2) - 2(m^2 + n^2) + 3mn - (2m + n) \bigr),$$ $$\begin{split} \mu_6 = \frac{mn(m+n+1)}{4032} \bigl( 35m^2 n^2 (m^2 + n^2) + 70 m^3 n^3 - 42 mn (m^3 + n^3) - 14 m^2 n^2 (m + n) +\\ + 16 (m^4 + n^4) - 52 mn (m^2 + n^2) - 43 m^2 n^2 + 32 (m^3 + n^3) +\\ + 14 mn (m + n) + 8 (m^2 + n^2) + 16 mn - 8 (m + n) \bigr), \end{split}$$ $$\varphi^{(k)}(z) = -\varphi(z) H_k(z),$$ $$H_3(z) = z^3 - 3z,$$ $$H_5(z) = z^5 - 10z^3 + 15z,$$ $$H_7(z) = z^7 - 21z^5 + 105z^3 - 105z.$$

The switch between the exact and approximate implementations should acknowledge the business requirements: I recommended using the approximation only if the exact implementation has insufficient performance.

The tie correction may be neglected; I recommended avoiding using the nil hypothesis, and use the minimum-effect approach with a threshold that prevents between-sample tied values.