Mann-Whitney U Test: Error Evaluation

The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

R

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In Mann-Whitney U Test: Minimum meaningful statistical level, I presented the exact equation for such a p-value. If we compare two samples of sizes $n$ and $m$, the minimum p-value we can observe with the one-tailed Mann–Whitney U test is $1/C_{n+m}^n$. For example, if $n=50$ and $m=5$, we get $1/C_{55}^5 \approx 0.0000002874587$. Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")$p.value
[1] 0.0001337028

The obtained p-value is $\approx 0.0001337028$, which is $\approx 465$ times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out!

In fact, the wilcox.test has two implementations of the Mann–Whitney U test: the exact one and the approximated one. The choice of a particular strategy is controlled by the exact parameter of this function. Here is an example from the official documentation:

> x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
> y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> wilcox.test(x, y, alternative = "greater", exact = TRUE)$p.value
[1] 0.1272061
> wilcox.test(x, y, alternative = "greater", exact = FALSE)$p.value
[1] 0.1223118

In this example, the approximation looks accurate, with a difference ofhis makes it an impra only about $5\%$. When the exact parameter is not specified, it is determined as follows:

if(is.null(exact))
    exact <- (n.x < 50) && (n.y < 50)

Before we start exploring different extreme cases, let us briefly review both methods to estimate the p-value within this test.

The first step of the Mann–Whitney U test is calculating the U statistic:

$$ U(x, y) = \sum_{i=1}^n \sum_{j=1}^m S(x_i, y_j),\quad

S(a,b) = \begin{cases} 1, & \text{if } a > b, \ 0.5, & \text{if } a = b, \ 0, & \text{if } a < b. \end{cases} $$

It is a simple rank-based statistic, which varies from $0$ to $nm$. Once we get the value of $U$, we should match it with the distribution of $U$ statistics under the valid null hypothesis. The distribution can be obtained in two different ways:

The exact implementation
The straightforward implementation uses dynamic programming to obtain the exact probability of obtaining the given $U$ statistic for the given $n$ and $m$. The only drawback of this solution is its computational complexity (which is $\mathcal{O}(n^2 m^2)$ for both time and memory). This makes this approach an impractical solution for large samples.
The approximated implementation
Fortunately, we have a method of large sample approximation (see hollander1973, pages 68–75). Long story short, we approximate the distribution of $U$ statistics with the following normal distribution (in the case of tied values, a small correction is needed):

$$ \mathcal{N}\left( \frac{nm}{2},\, \frac{nm(n+m+1)}{12} \right) $$

For the middle values of $U$ (around $nm/2$), this approximation provides reasonably accurate results. However, it performs extremely poorly at the tails. To illustrate the problems, we perform a simple numerical study:

Enumerate $n$ from $1$ to $50$, set $m$ to $50$ (since $50$ is a critical value, after which wilcox.text switches to the approximated algorithm);
For each pair of $n,\, m$, we consider two non-overlapped samples $x = \{ 101, 102, \ldots, 100 + n \}$ and $y = \{ 1, 2, \ldots, m \}$;
For each pair of samples $x$ and $y$, we estimate the exact and approximated p-values using wilcox.test (R 4.3.0). We also evaluate the ratio between two obtained p-values.

This simulation can be performed using the following simple script:

options(scipen = 999)
df <- expand.grid(n = 1:50, m = 50)
df$approx <- sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = F)$p.value)
df$exact <-  sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = T)$p.value)
df$ratio <- df$approx / df$exact
df

Here are the results:

n	m	approx	exact	ratio
1	50	0.048011543131958198116216	0.019607843137254901688670827297755750	2.448589
2	50	0.009252304327263300917639	0.000754147812971342414341269222433084	12.268556
3	50	0.002068602594726797437585	0.000042687612054981640980627632941946	48.459084
4	50	0.000507425712710512176983	0.000003162045337406047683961830482846	160.473889
5	50	0.000133702777437926781292	0.000000287458667036913421002024661768	465.120008
6	50	0.000037419721898732403792	0.000000030799142896812156031614014003	1214.959846
7	50	0.000011044391063200054742	0.000000003782350882064650319878251473	2919.980564
8	50	0.000003420097224807733340	0.000000000521703569939951803913405917	6555.633164
9	50	0.000001106806028706227259	0.000000000079581900499314674285699250	13907.760706
10	50	0.000000373114142492364927	0.000000000013263650083219112919478920	28130.577944
11	50	0.000000130667516485322913	0.000000000002391805752711643505365161	54631.324612
12	50	0.000000047426982847987003	0.000000000000462930145686124477766408	102449.545120
13	50	0.000000017803734950873793	0.000000000000095525268157454267935196	186377.230803
14	50	0.000000006899453656698780	0.000000000000020896152409443119138672	330178.184075
15	50	0.000000002755517795088092	0.000000000000004822189017563796906354	571424.675609
16	50	0.000000001132420441204299	0.000000000000001169015519409405310631	968695.814899
17	50	0.000000000478204516517026	0.000000000000000296615878059102926844	1612201.341500
18	50	0.000000000207230566590806	0.000000000000000078515967721527219964	2639342.959203
19	50	0.000000000092045404447794	0.000000000000000021620338937811845233	4257352.519429
20	50	0.000000000041857100940344	0.000000000000000006177239696517670397	6776020.196196
21	50	0.000000000019466963240158	0.000000000000000001827070896153113595	10654738.839717
22	50	0.000000000009250432374139	0.000000000000000000558271662713451387	16569768.791735
23	50	0.000000000004487040794828	0.000000000000000000175893811539854518	25509941.228442
24	50	0.000000000002219810231291	0.000000000000000000057046641580493394	38912198.330882
25	50	0.000000000001119116799357	0.000000000000000000019015547193497798	58852726.559442
26	50	0.000000000000574519790304	0.000000000000000000006505318776722924	88315393.914190
27	50	0.000000000000300116336043	0.000000000000000000002281085804824923	131567315.621484
28	50	0.000000000000159415937793	0.000000000000000000000818851314552536	194682398.330711
29	50	0.000000000000086049778707	0.000000000000000000000300590988886374	286268657.038994
30	50	0.000000000000047171214487	0.000000000000000000000112721620832390	418475303.483840
31	50	0.000000000000026245863540	0.000000000000000000000043140373404989	608382855.044904
32	50	0.000000000000014813624414	0.000000000000000000000016835267670240	879916179.758513
33	50	0.000000000000008477165239	0.000000000000000000000006693540158047	1266469616.742985
34	50	0.000000000000004916004777	0.000000000000000000000002709290063971	1814499245.458048
35	50	0.000000000000002887621675	0.000000000000000000000001115590026341	2588425503.078257
36	50	0.000000000000001717269860	0.000000000000000000000000466991173817	3677306886.886681
37	50	0.000000000000001033523240	0.000000000000000000000000198605441738	5203901920.395980
38	50	0.000000000000000629226919	0.000000000000000000000000085761440751	7336944364.786281
39	50	0.000000000000000387373398	0.000000000000000000000000037580856059	10307732138.774328
40	50	0.000000000000000241060228	0.000000000000000000000000016702602693	14432494906.850683
41	50	0.000000000000000151579231	0.000000000000000000000000007525348466	20142486658.620667
42	50	0.000000000000000096277356	0.000000000000000000000000003435485169	28024384121.614346
43	50	0.000000000000000061750057	0.000000000000000000000000001588450132	38874406735.166344
44	50	0.000000000000000039980098	0.000000000000000000000000000743529849	53770670472.618752
45	50	0.000000000000000026122455	0.000000000000000000000000000352198350	74169725522.991974
46	50	0.000000000000000017219619	0.000000000000000000000000000168761709	102035109626.813110
47	50	0.000000000000000011448630	0.000000000000000000000000000081771137	140008207685.065460
48	50	0.000000000000000007675202	0.000000000000000000000000000040051169	191634915718.995544
49	50	0.000000000000000005187081	0.000000000000000000000000000019823306	261665784427.152771
50	50	0.000000000000000003533036	0.000000000000000000000000000009911653	356452748856.307617

Thus, for $n=m=50$, the ratio between the exact (the correct one) and approximated (the default one) solutions is about $3.5 \cdot 10^{11}$. Of course, both p-values are pretty small in this case. If you are using the notorious $\alpha = 0.05$, you will most likely avoid such a problem (however, for $n=1,\, m = 50$, the exact true p-value is $0.0196$, while the approximated one is $0.0480$, whish is alarmingly close to $0.05$).

If you prefer using extremely low statistical significance levels and there are non-zero chances of getting extreme values of the $U$ statistic, you may consider making exact = TRUE your default way to call wilcox.test. Such a simple trick may save you from misleading research results.

Python

In SciPy¹, the Mann-Whitney U test is available via the function mannwhitneyu. Similarly to R, it has two methods to estimate the value: the exact one and the normally approximated one. The difference between these methods can be shown using the following snippet:

import numpy as np
from scipy.stats import mannwhitneyu

x = np.arange(101, 106) # n = 5
y = np.arange(1, 51)    # m = 50

print(mannwhitneyu(x, y, alternative="greater", method="exact"))
print(mannwhitneyu(x, y, alternative="greater", method="asymptotic"))

This code gives the following output:

MannwhitneyuResult(statistic=250.0, pvalue=2.874586670369134e-07)
MannwhitneyuResult(statistic=250.0, pvalue=0.00013370277743792665)

The results match the situation we observed in R: the approximated value is 465 times larger than the exact value.

However, the default strategy for choosing between exact and asymptotic is different between R and SciPy. In R, wilcox.text uses the exact strategy when n < 50 AND m < 50 (and there are no tied values). In SciPy, mannwhitneyu uses the exact strategy when n <= 8 OR m <= 8 (and there are no tied values). Therefore, SciPy switches to the asymptotic methods when both samples contain at least 9 elements. Here is a snippet that illustrates this transition:

import numpy as np
from scipy.stats import mannwhitneyu

x = np.arange(101, 109) # n = 8
y = np.arange(1, 9)     # m = 8

print("n = m = 8")
print("auto:       " + str(mannwhitneyu(x, y, alternative="greater")))
print("exact:      " + str(mannwhitneyu(x, y, alternative="greater", method="exact")))
print("asymptotic: " + str(mannwhitneyu(x, y, alternative="greater", method="asymptotic")))
print("")

x = np.arange(101, 110) # n = 9
y = np.arange(1, 10)    # m = 9

print("n = m = 9")
print("auto:       " + str(mannwhitneyu(x, y, alternative="greater")))
print("exact:      " + str(mannwhitneyu(x, y, alternative="greater", method="exact")))
print("asymptotic: " + str(mannwhitneyu(x, y, alternative="greater", method="asymptotic")))

Here is the corresponding output:

n = m = 8
auto:       MannwhitneyuResult(statistic=64.0, pvalue=7.77000777000777e-05)
exact:      MannwhitneyuResult(statistic=64.0, pvalue=7.77000777000777e-05)
asymptotic: MannwhitneyuResult(statistic=64.0, pvalue=0.00046955284955859495)

n = m = 9
auto:       MannwhitneyuResult(statistic=81.0, pvalue=0.00020614740103084563)
exact:      MannwhitneyuResult(statistic=81.0, pvalue=2.0567667626491154e-05)
asymptotic: MannwhitneyuResult(statistic=81.0, pvalue=0.00020614740103084563)

Thus, we should be careful when using mannwhitneyu from SciPy: the asymptotic method is enabled by default even for relatively small samples which can lead to significant errors in p-value evaluation for extreme values of the U statistic.

We consider SciPy 1.10.1 ↩︎