Mann-Whitney U Test: Error Evaluation

The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

R

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In Mann-Whitney U Test: Minimum meaningful statistical level, I presented the exact equation for such a p-value. If we compare two samples of sizes $n$ and $m$, the minimum p-value we can observe with the one-tailed Mann–Whitney U test is $1/C_{n+m}^n$. For example, if $n=50$ and $m=5$, we get $1/C_{55}^5 \approx 0.0000002874587$. Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")$p.value
[1] 0.0001337028

The obtained p-value is $\approx 0.0001337028$, which is $\approx 465$ times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out!

In fact, the wilcox.test has two implementations of the Mann–Whitney U test: the exact one and the approximated one. The choice of a particular strategy is controlled by the exact parameter of this function. Here is an example from the official documentation:

> x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
> y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> wilcox.test(x, y, alternative = "greater", exact = TRUE)$p.value
[1] 0.1272061
> wilcox.test(x, y, alternative = "greater", exact = FALSE)$p.value
[1] 0.1223118

In this example, the approximation looks accurate, with a difference ofhis makes it an impra only about $5\%$. When the exact parameter is not specified, it is determined as follows:

if(is.null(exact))
    exact <- (n.x < 50) && (n.y < 50)

Before we start exploring different extreme cases, let us briefly review both methods to estimate the p-value within this test.

The first step of the Mann–Whitney U test is calculating the U statistic:

$$ U(x, y) = \sum_{i=1}^n \sum_{j=1}^m S(x_i, y_j),\quad

S(a,b) = \begin{cases} 1, & \text{if } a > b, \ 0.5, & \text{if } a = b, \ 0, & \text{if } a < b. \end{cases} $$

It is a simple rank-based statistic, which varies from $0$ to $nm$. Once we get the value of $U$, we should match it with the distribution of $U$ statistics under the valid null hypothesis. The distribution can be obtained in two different ways:

  • The exact implementation
    The straightforward implementation uses dynamic programming to obtain the exact probability of obtaining the given $U$ statistic for the given $n$ and $m$. The only drawback of this solution is its computational complexity (which is $\mathcal{O}(n^2 m^2)$ for both time and memory). This makes this approach an impractical solution for large samples.
  • The approximated implementation
    Fortunately, we have a method of large sample approximation (see hollander1973, pages 68–75). Long story short, we approximate the distribution of $U$ statistics with the following normal distribution (in the case of tied values, a small correction is needed):
$$ \mathcal{N}\left( \frac{nm}{2},\, \frac{nm(n+m+1)}{12} \right) $$

For the middle values of $U$ (around $nm/2$), this approximation provides reasonably accurate results. However, it performs extremely poorly at the tails. To illustrate the problems, we perform a simple numerical study:

  • Enumerate $n$ from $1$ to $50$, set $m$ to $50$ (since $50$ is a critical value, after which wilcox.text switches to the approximated algorithm);
  • For each pair of $n,\, m$, we consider two non-overlapped samples $x = \{ 101, 102, \ldots, 100 + n \}$ and $y = \{ 1, 2, \ldots, m \}$;
  • For each pair of samples $x$ and $y$, we estimate the exact and approximated p-values using wilcox.test (R 4.3.0). We also evaluate the ratio between two obtained p-values.

This simulation can be performed using the following simple script:

options(scipen = 999)
df <- expand.grid(n = 1:50, m = 50)
df$approx <- sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = F)$p.value)
df$exact <-  sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = T)$p.value)
df$ratio <- df$approx / df$exact
df

Here are the results:

nmapproxexactratio
1500.0480115431319581981162160.0196078431372549016886708272977557502.448589
2500.0092523043272633009176390.00075414781297134241434126922243308412.268556
3500.0020686025947267974375850.00004268761205498164098062763294194648.459084
4500.0005074257127105121769830.000003162045337406047683961830482846160.473889
5500.0001337027774379267812920.000000287458667036913421002024661768465.120008
6500.0000374197218987324037920.0000000307991428968121560316140140031214.959846
7500.0000110443910632000547420.0000000037823508820646503198782514732919.980564
8500.0000034200972248077333400.0000000005217035699399518039134059176555.633164
9500.0000011068060287062272590.00000000007958190049931467428569925013907.760706
10500.0000003731141424923649270.00000000001326365008321911291947892028130.577944
11500.0000001306675164853229130.00000000000239180575271164350536516154631.324612
12500.0000000474269828479870030.000000000000462930145686124477766408102449.545120
13500.0000000178037349508737930.000000000000095525268157454267935196186377.230803
14500.0000000068994536566987800.000000000000020896152409443119138672330178.184075
15500.0000000027555177950880920.000000000000004822189017563796906354571424.675609
16500.0000000011324204412042990.000000000000001169015519409405310631968695.814899
17500.0000000004782045165170260.0000000000000002966158780591029268441612201.341500
18500.0000000002072305665908060.0000000000000000785159677215272199642639342.959203
19500.0000000000920454044477940.0000000000000000216203389378118452334257352.519429
20500.0000000000418571009403440.0000000000000000061772396965176703976776020.196196
21500.0000000000194669632401580.00000000000000000182707089615311359510654738.839717
22500.0000000000092504323741390.00000000000000000055827166271345138716569768.791735
23500.0000000000044870407948280.00000000000000000017589381153985451825509941.228442
24500.0000000000022198102312910.00000000000000000005704664158049339438912198.330882
25500.0000000000011191167993570.00000000000000000001901554719349779858852726.559442
26500.0000000000005745197903040.00000000000000000000650531877672292488315393.914190
27500.0000000000003001163360430.000000000000000000002281085804824923131567315.621484
28500.0000000000001594159377930.000000000000000000000818851314552536194682398.330711
29500.0000000000000860497787070.000000000000000000000300590988886374286268657.038994
30500.0000000000000471712144870.000000000000000000000112721620832390418475303.483840
31500.0000000000000262458635400.000000000000000000000043140373404989608382855.044904
32500.0000000000000148136244140.000000000000000000000016835267670240879916179.758513
33500.0000000000000084771652390.0000000000000000000000066935401580471266469616.742985
34500.0000000000000049160047770.0000000000000000000000027092900639711814499245.458048
35500.0000000000000028876216750.0000000000000000000000011155900263412588425503.078257
36500.0000000000000017172698600.0000000000000000000000004669911738173677306886.886681
37500.0000000000000010335232400.0000000000000000000000001986054417385203901920.395980
38500.0000000000000006292269190.0000000000000000000000000857614407517336944364.786281
39500.0000000000000003873733980.00000000000000000000000003758085605910307732138.774328
40500.0000000000000002410602280.00000000000000000000000001670260269314432494906.850683
41500.0000000000000001515792310.00000000000000000000000000752534846620142486658.620667
42500.0000000000000000962773560.00000000000000000000000000343548516928024384121.614346
43500.0000000000000000617500570.00000000000000000000000000158845013238874406735.166344
44500.0000000000000000399800980.00000000000000000000000000074352984953770670472.618752
45500.0000000000000000261224550.00000000000000000000000000035219835074169725522.991974
46500.0000000000000000172196190.000000000000000000000000000168761709102035109626.813110
47500.0000000000000000114486300.000000000000000000000000000081771137140008207685.065460
48500.0000000000000000076752020.000000000000000000000000000040051169191634915718.995544
49500.0000000000000000051870810.000000000000000000000000000019823306261665784427.152771
50500.0000000000000000035330360.000000000000000000000000000009911653356452748856.307617

Thus, for $n=m=50$, the ratio between the exact (the correct one) and approximated (the default one) solutions is about $3.5 \cdot 10^{11}$. Of course, both p-values are pretty small in this case. If you are using the notorious $\alpha = 0.05$, you will most likely avoid such a problem (however, for $n=1,\, m = 50$, the exact true p-value is $0.0196$, while the approximated one is $0.0480$, whish is alarmingly close to $0.05$).

If you prefer using extremely low statistical significance levels and there are non-zero chances of getting extreme values of the $U$ statistic, you may consider making exact = TRUE your default way to call wilcox.test. Such a simple trick may save you from misleading research results.

Python

In SciPy1, the Mann-Whitney U test is available via the function mannwhitneyu. Similarly to R, it has two methods to estimate the value: the exact one and the normally approximated one. The difference between these methods can be shown using the following snippet:

import numpy as np
from scipy.stats import mannwhitneyu

x = np.arange(101, 106) # n = 5
y = np.arange(1, 51)    # m = 50

print(mannwhitneyu(x, y, alternative="greater", method="exact"))
print(mannwhitneyu(x, y, alternative="greater", method="asymptotic"))

This code gives the following output:

MannwhitneyuResult(statistic=250.0, pvalue=2.874586670369134e-07)
MannwhitneyuResult(statistic=250.0, pvalue=0.00013370277743792665)

The results match the situation we observed in R: the approximated value is 465 times larger than the exact value.

However, the default strategy for choosing between exact and asymptotic is different between R and SciPy. In R, wilcox.text uses the exact strategy when n < 50 AND m < 50 (and there are no tied values). In SciPy, mannwhitneyu uses the exact strategy when n <= 8 OR m <= 8 (and there are no tied values). Therefore, SciPy switches to the asymptotic methods when both samples contain at least 9 elements. Here is a snippet that illustrates this transition:

import numpy as np
from scipy.stats import mannwhitneyu

x = np.arange(101, 109) # n = 8
y = np.arange(1, 9)     # m = 8

print("n = m = 8")
print("auto:       " + str(mannwhitneyu(x, y, alternative="greater")))
print("exact:      " + str(mannwhitneyu(x, y, alternative="greater", method="exact")))
print("asymptotic: " + str(mannwhitneyu(x, y, alternative="greater", method="asymptotic")))
print("")

x = np.arange(101, 110) # n = 9
y = np.arange(1, 10)    # m = 9

print("n = m = 9")
print("auto:       " + str(mannwhitneyu(x, y, alternative="greater")))
print("exact:      " + str(mannwhitneyu(x, y, alternative="greater", method="exact")))
print("asymptotic: " + str(mannwhitneyu(x, y, alternative="greater", method="asymptotic")))

Here is the corresponding output:

n = m = 8
auto:       MannwhitneyuResult(statistic=64.0, pvalue=7.77000777000777e-05)
exact:      MannwhitneyuResult(statistic=64.0, pvalue=7.77000777000777e-05)
asymptotic: MannwhitneyuResult(statistic=64.0, pvalue=0.00046955284955859495)

n = m = 9
auto:       MannwhitneyuResult(statistic=81.0, pvalue=0.00020614740103084563)
exact:      MannwhitneyuResult(statistic=81.0, pvalue=2.0567667626491154e-05)
asymptotic: MannwhitneyuResult(statistic=81.0, pvalue=0.00020614740103084563)

Thus, we should be careful when using mannwhitneyu from SciPy: the asymptotic method is enabled by default even for relatively small samples which can lead to significant errors in p-value evaluation for extreme values of the U statistic.


  1. We consider SciPy 1.10.1 ↩︎